GabrieLinux ht://CheckING

© 1999-2004 - Comune di Prato - Italia
Some portions © 1995, 2000 - The ht://Dig Group
Some portions © 2008 - Devise.IT
Distributed under GNU GPL



ht://Check Features

ht://Check is made up of two logical parts: a "spider" which starts checking URLs from a specific one or from a list of them; and an "analyser" which takes the results of the first part and shows summaries (this part can be done via console or by using the PHP interface through a web server).

The "Spider" or "Crawler"

- HTTP/1.1 compliant with persistent connections and cookies support
- HTTP Basic authentication supported
- HTTP Proxy support (basic authentication included)
- Crawl customisable through many configuration attributes which let the user
limit the digging on URLs pattern matchings and distance ("hops") from the first URL.
- MySQL databases directly created by the spider
- MySQL connections through user or general option files as defined by the
database system (/etc/my.cnf or ~/.my.cnf)

No support for Javascript and other protocols like HTTPS, FTP, NNTP and local files.

The "Analyser"

Just a preface: as long as all of the data after a crawl are all stored into a MySQL database, it is pretty easy to get your desired info by querying the database. The spider, anyway, is included into the 'htcheck' application, which at the end shows by itself a small text report. In a second time you can always retrieve info from that database by building your own interface (PHP, Perl for instance) or by just using the default one written in PHP.

I also believe that ht://Check builds a data source that can be used for Web structure mining, revealing knowledge about the relationships within and between documents. Also Web usage mining tools can find interesting information from ht://Check, and use it as auxiliary data source in order to build a sort of site map.

'htcheck' (the console appllication) gives you a summary of broken links, broken anchors, servers seen, content-types encountered.

The PHP interface lets you perform:
- Queries regarding URLs, by choosing many discrimineting criterias such as
pattern matching, status code, content-type, size.
- Queries regarding links, with pattern matching on both source and destination
URLs (also with regular expressions), the results (broken, ok, anchor not found,
redirected) and their type (normal, direct, redirected).
- Info regarding specific URLs (outgoing and incoming links, last modify
datetime, etc ...
- Info regarding specific links (broken or ok) and the HTML instruction that
issued it
- Statistics on documents retrieved

The database schema

Here you can find a very *skimmy* entities-relationships diagram of the every ht://Check database being created by the spider (click on the image for details).

ht://Check database entities-relationships diagram

The tables as of the 'mysqldump' program

Here follows the structure of the tables of the a typical ht://Check database, as created by the mysqldump program. Please refer to the MySQL documentation for more and further information. And if you find some useful advice and suggestions to give me regarding the database (and of course everything else) please come up tome with an e-mail! :-)

#
# Table structure for table `Cookies`
#

CREATE TABLE Cookies (
  IDCookie mediumint(8) unsigned NOT NULL default '0',
  Name varchar(255) NOT NULL default '',
  Value text NOT NULL,
  Path varchar(255) NOT NULL default '',
  Domain varchar(255) NOT NULL default '',
  MaxAge mediumint(9) NOT NULL default '-1',
  Version tinyint(4) NOT NULL default '0',
  SrcUrl varchar(255) NOT NULL default '',
  Expires datetime NOT NULL default '0000-00-00 00:00:00',
  Secure tinyint(4) NOT NULL default '0',
  DomainValid tinyint(4) NOT NULL default '0',
  PRIMARY KEY  (IDCookie)
)

#
# Table structure for table `HtmlAttribute`
#

CREATE TABLE HtmlAttribute (
  IDUrl mediumint(8) unsigned NOT NULL default '0',
  TagPosition smallint(5) unsigned NOT NULL default '0',
  AttrPosition tinyint(3) unsigned NOT NULL default '0',
  Attribute varchar(32) NOT NULL default '',
  Content varchar(255) default NULL,
  PRIMARY KEY  (IDUrl,TagPosition,AttrPosition),
  KEY Idx_Attribute (Attribute(8))
)

#
# Table structure for table `HtmlStatement`
#

CREATE TABLE HtmlStatement (
  IDUrl mediumint(8) unsigned NOT NULL default '0',
  TagPosition smallint(5) unsigned NOT NULL default '0',
  Row mediumint(8) unsigned NOT NULL default '0',
  Tag varchar(32) NOT NULL default '',
  Statement varchar(255) default NULL,
  PRIMARY KEY  (IDUrl,TagPosition),
  KEY Idx_Tag (Tag(4))
)

#
# Table structure for table `Link`
#

CREATE TABLE Link (
  IDUrlSrc mediumint(8) unsigned NOT NULL default '0',
  IDUrlDest mediumint(8) unsigned NOT NULL default '0',
  TagPosition smallint(5) unsigned NOT NULL default '0',
  AttrPosition tinyint(3) unsigned NOT NULL default '0',
  Anchor varchar(255) binary NOT NULL default '',
  LinkType enum('Normal','Direct','Redirection') NOT NULL default 'Normal',
  LinkResult enum('NotChecked','NotRetrieved','OK','Broken','AnchorNotFound','Redirected','NotAuthorized','EMail','Javascript','BadEncoded') NOT NULL default 'NotChecked',
  LinkDomain enum('SameServer','Internal','External') default NULL,
  PRIMARY KEY  (IDUrlSrc,IDUrlDest,TagPosition,AttrPosition),
  KEY Idx_IDUrlDest (IDUrlDest),
  KEY Idx_Anchor (Anchor(8)),
  KEY Idx_LinkType (LinkType),
  KEY Idx_LinkResult (LinkResult)
)

#
# Table structure for table `Schedule`
#

CREATE TABLE Schedule (
  IDUrl mediumint(8) unsigned NOT NULL default '0',
  IDServer smallint(5) unsigned NOT NULL default '0',
  Url varchar(255) binary NOT NULL default '',
  Status enum('ToBeRetrieved','Retrieved','CheckIfExists','Checked','BadQueryString','BadExtension','MaxHopCount','FileProtocol','EMail','Javascript','NotValidService','Malformed') NOT NULL default 'ToBeRetrieved',
  Domain enum('Internal','External') default NULL,
  CreationTime datetime NOT NULL default '0000-00-00 00:00:00',
  IDReferer mediumint(8) unsigned NOT NULL default '0',
  HopCount tinyint(3) unsigned NOT NULL default '0',
  PRIMARY KEY  (IDUrl),
  KEY Idx_IDServer (IDServer),
  KEY Idx_Url (Url(64)),
  KEY Idx_Status (Status)
)

#
# Table structure for table `Server`
#

CREATE TABLE Server (
  IDServer smallint(5) unsigned NOT NULL default '0',
  Server varchar(255) NOT NULL default '',
  IPAddress varchar(15) default NULL,
  Port smallint(5) unsigned NOT NULL default '0',
  HttpServer varchar(255) NOT NULL default '',
  HttpVersion varchar(255) NOT NULL default '',
  PersistentConnection tinyint(1) unsigned NOT NULL default '0',
  Requests smallint(5) unsigned NOT NULL default '0',
  PRIMARY KEY  (IDServer),
  KEY Idx_Server (Server(24)),
  KEY Idx_Requests (Requests)
)

#
# Table structure for table `Url`
#

CREATE TABLE Url (
  IDUrl mediumint(8) unsigned NOT NULL default '0',
  IDServer smallint(5) unsigned NOT NULL default '0',
  Url varchar(255) binary NOT NULL default '',
  ContentType varchar(32) NOT NULL default '',
  ConnStatus enum('OK','NoHeader','NoHost','NoPort','NoConnection','ConnectionDown','ServiceNotValid','OtherError') NOT NULL default 'OK',
  ContentLanguage varchar(16) NOT NULL default '',
  TransferEncoding varchar(32) NOT NULL default '',
  LastModified datetime NOT NULL default '0000-00-00 00:00:00',
  LastAccess datetime NOT NULL default '0000-00-00 00:00:00',
  Size int(11) NOT NULL default '0',
  StatusCode smallint(6) NOT NULL default '0',
  ReasonPhrase varchar(32) NOT NULL default '',
  Location varchar(255) binary NOT NULL default '',
  Title varchar(255) NOT NULL default '',
  Contents mediumtext,
  SizeAdd int(11) NOT NULL default '0',
  PRIMARY KEY  (IDUrl),
  KEY Idx_IDServer (IDServer),
  KEY Idx_Url (Url(64)),
  KEY Idx_ContentType (ContentType(16)),
  KEY Idx_StatusCode (StatusCode)
)

#
# Table structure for table `htCheck`
#

CREATE TABLE htCheck (
  StartTime datetime NOT NULL default '0000-00-00 00:00:00',
  EndTime datetime NOT NULL default '0000-00-00 00:00:00',
  ScheduledUrls mediumint(8) unsigned NOT NULL default '0',
  TotUrls mediumint(8) unsigned NOT NULL default '0',
  RetrievedUrls mediumint(8) unsigned NOT NULL default '0',
  TCPConnections mediumint(8) unsigned NOT NULL default '0',
  ServerChanges mediumint(8) unsigned NOT NULL default '0',
  HTTPRequests mediumint(8) unsigned NOT NULL default '0',
  HTTPSeconds mediumint(8) unsigned NOT NULL default '0',
  HTTPBytes bigint(20) unsigned NOT NULL default '0',
  User varchar(255) NOT NULL default '',
  PRIMARY KEY  (StartTime,EndTime)
)


| Home Page | Download | Documentation | Support | Screenshots | Thanks to ... | Uses | FAQ | General Info |

Hosted by

SourceForge Logo
ht://Dig MySQL PHP The most famous penguin of the world

ht://Check - More than a link checker - http://htcheck.sourceforge.net/
© 1999-2004 - Comune di Prato - Italia
Maintainer: Gabriele Bartolini - angusgb@users.sourceforge.net

Italia