Importing Web Content
VS-R-1 Importing Web Content
Importing Web Content
Importing Web Content
Default traversal algorithm of the web robot.- Keep into the queue only the start link
- Get the oldest entry from the queue (for the given target) and mark as 'pending'
- Retrieve the link
- Parse the page and extract the URLs
- Apply the "Traverse and do not traverse" rules
- Insert the new entries in the queue
- Mark queue entry as 'retrieved'
- Repeat from step 2, if there are more 'waiting' entries, otherwise finish
- Traverse - this is a semicolon separated string which contains the path masks (the like predicate chars are applicable). Any link that matches any of these masks will be added to the queue for processing, but only if it doesn't match the do not traverse masks.
- Do not traverse - the format of the string is the same (semicolon separated), but the masks mean the opposite. If any link matches the rule it will be skipped from processing.
Tables of the Web Robot
-- Target site definition
create table WS.WS.VFS_SITE (
VS_DESCR varchar, -- human readable description
VS_HOST varchar, -- target host name
VS_URL varchar, -- start URL path
VS_INX varchar (5),-- reserved
VS_OWN integer, -- local WebDAV owner of the retrieved content
VS_ROOT varchar, -- local WebDAV collection destination
VS_NEWER datetime, -- update only newer than (this date)
VS_DEL varchar, -- delete local resource if removal on target detected
VS_FOLLOW varchar, -- walk on this links
VS_NFOLLOW varchar, -- do not walk on this links
VS_SRC varchar, -- retrieve images
VS_OPTIONS varchar, -- if target wants authentication the uid/pwd are stored
VS_METHOD varchar, -- use WebDAV or traditional HTTP method
VS_OTHER varchar, -- walk on foreign links
primary key (VS_HOST, VS_ROOT));
create index VS_HOST_ROOT on WS.WS.VFS_SITE (VS_HOST, VS_URL, VS_ROOT)
;
-- Queue table
create table WS.WS.VFS_QUEUE (
VQ_HOST varchar, -- target host name
VQ_TS datetime, -- date and time of adding into the queue
VQ_URL varchar, -- target URL path
VQ_ROOT varchar, -- local WebDAV folder destination
VQ_STAT varchar (15), -- status
VQ_OTHER varchar, -- retrieved from other site
primary key (VQ_HOST, VQ_URL, VQ_ROOT));
create index VQ_HOST_ROOT_STAT on WS.WS.VFS_QUEUE (VQ_HOST, VQ_ROOT, VQ_STAT);
create index VQ_HOST_ROOT on WS.WS.VFS_QUEUE (VQ_HOST, VQ_ROOT);
create index VQ_HOST_TIME on WS.WS.VFS_QUEUE (VQ_HOST, VQ_ROOT, VQ_TS);
create index VQ_TS on VFS_QUEUE (VQ_TS)
;
-- Retrieved URLs table
create table WS.WS.VFS_URL (
VU_HOST varchar, -- target host name
VU_URL varchar, -- retrieved URL path
VU_ROOT varchar, -- local WebDAV folder destination
VU_CHKSUM varchar, -- content checksum
VU_ETAG varchar, -- target's ETag
VU_CPTIME datetime, -- retrieval date and time
VU_OTHER varchar, -- retrieved from other site
primary key (VU_HOST, VU_URL, VU_ROOT));
create index VU_HOST_ROOT on WS.WS.VFS_URL (VU_HOST, VU_ROOT)
;
Hooks for Parametizing the Web Robot
The custom queue hook can be used to extract next entry from the robot's queue and follow a custom algorithm. The prototype of the hook is:
create procedure
DB.DBA.VFS_HOOK (in host varchar, in collection varchar, out url varchar, in data any)
{
-- choose an entry from queue (may use the user-specific date passed as 'data' variable)
-- mark as 'pending'
-- set the 'url' from the chosen entry
-- return 1 on success or 0 if no more entries
}
;
Web Index Example
- Example does a breadth first traversal of all reachable sites.
- To actually get some links retrieved change all the occurences of www.foo.bar to a site of your choice.
- The site to process are separately maintained in an 'interesting sites' list.
- The example runs on multiple threads.
- The interface can start n threads running the robot.
- There is a web page showing the running status, e.g. number of pages fetched in the last minute.
- A stop button will kill http threads running the web robot.
- Important: The example requires a license to run, due to multiple web threads.
| View the source | Action |
|---|---|
| 1. vs_r_1.sql | Set the initial state |
| 2. vs_r_1_run.vsp | Run |
| 3. vs_r_1_stat.vsp | Run |
OpenLink Home
Technical Support