de.kosi.utiltest
Class Spider

java.lang.Object
  |
  +--de.kosi.utiltest.Spider
All Implemented Interfaces:
java.util.Enumeration, Acme.HtmlObserver

public class Spider
extends java.lang.Object
implements Acme.HtmlObserver, java.util.Enumeration

A web-robot class.

This is an Enumeration class that traverses the web starting at a given URL. It fetches HTML files and parses them for new URLs to look at. All files it encounters, HTML or otherwise, are returned by the nextElement() method as a URLConnection.

The traversal is breadth-first, and by default it is limited to files at or below the starting point - same protocol, hostname, and initial directory.

Because of the security restrictions on applets, this is currently only useful from applications.

Sample code:

 Enumeration spider = new Acme.Spider( "http://some.site.com/whatever/" );
 while ( spider.hasMoreElements() )
     {
     URLConnection conn = (URLConnection) spider.nextElement();
     // Then do whatever you like with conn:
     URL thisUrl = conn.getURL();
     String thisUrlStr = thisUrl.toExternalForm();
     String mimeType = conn.getContentType();
     long changed = conn.getLastModified();
     InputStream s = conn.getInputStream();
     // Etc. etc. etc., your code here.
     }
 
There are also a couple of methods you can override via a subclass, to control things like the search limits and what gets done with broken links.

Author:
Jef Poskanzer

Constructor Summary
Spider()
          Constructor with no size limits, and the default error stream.
Spider(int todoLimit, int doneLimit)
          Constructor with size limits.
Spider(int todoLimit, int doneLimit, java.io.PrintStream err)
          Constructor with size limits.
Spider(java.io.PrintStream err)
          Constructor with no size limits.
Spider(java.lang.String urlStr)
          Constructor with a single URL and no size limits, and the default error stream.
Spider(java.lang.String urlStr, java.io.PrintStream err)
          Constructor with a single URL and no size limits.
 
Method Summary
 void addObserver(Acme.HtmlObserver observer)
          Add an extra observer to the scanners we make.
 void addUrl(java.lang.String urlStr)
          Add a URL to the to-do list.
protected  void brokenLink(java.lang.String fromUrlStr, java.lang.String toUrlStr, java.lang.String errmsg)
          This method can be overridden by a subclass if you want to change the broken link policy.
protected  boolean doThisUrl(java.lang.String thisUrlStr, int depth, java.lang.String baseUrlStr)
          This method can be overridden by a subclass if you want to change the search policy.
 void gotAHREF(java.lang.String urlStr, java.net.URL contextUrl, java.lang.Object clientData)
          Acme.HtmlObserver callback.
 void gotAREAHREF(java.lang.String urlStr, java.net.URL contextUrl, java.lang.Object clientData)
          Acme.HtmlObserver callback.
 void gotBASEHREF(java.lang.String urlStr, java.net.URL contextUrl, java.lang.Object clientData)
          Acme.HtmlObserver callback.
 void gotBODYBACKGROUND(java.lang.String urlStr, java.net.URL contextUrl, java.lang.Object clientData)
          Acme.HtmlObserver callback.
 void gotFRAMESRC(java.lang.String urlStr, java.net.URL contextUrl, java.lang.Object clientData)
          Acme.HtmlObserver callback.
 void gotIMGSRC(java.lang.String urlStr, java.net.URL contextUrl, java.lang.Object clientData)
          Acme.HtmlObserver callback.
 void gotLINKHREF(java.lang.String urlStr, java.net.URL contextUrl, java.lang.Object clientData)
          Acme.HtmlObserver callback.
 boolean hasMoreElements()
          Standard Enumeration method.
static void main(java.lang.String[] args)
          Test program.
 java.lang.Object nextElement()
          Standard Enumeration method.
protected  void reportError(java.lang.String fromUrlStr, java.lang.String toUrlStr, java.lang.String errmsg)
          This method can be overridden by a subclass if you want to change the error reporting policy.
 void setAuth(java.lang.String authCookie)
          Set the authorization cookie.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Spider

public Spider(java.io.PrintStream err)
Constructor with no size limits.
Parameters:
err - the error stream

Spider

public Spider()
Constructor with no size limits, and the default error stream.

Spider

public Spider(java.lang.String urlStr,
              java.io.PrintStream err)
       throws java.net.MalformedURLException
Constructor with a single URL and no size limits.
Parameters:
urlStr - the URL to start off the enumeration
err - the error stream
Throws:
java.net.MalformedURLException - if the urlStr does not contain a valid URL.

Spider

public Spider(java.lang.String urlStr)
       throws java.net.MalformedURLException
Constructor with a single URL and no size limits, and the default error stream.
Parameters:
urlStr - the URL to start off the enumeration
Throws:
java.net.MalformedURLException - if the urlStr does not contain a valid URL.

Spider

public Spider(int todoLimit,
              int doneLimit,
              java.io.PrintStream err)
Constructor with size limits. This version lets you specify limits on the todo queue and the done hash-table. If you are using Spider for a large, multi-site traversal, then you may need to set these limits to avoid running out of memory. Note that setting a todoLimit means the traversal will not be complete - you may skip some URLs. And setting the doneLimit means it may re-visit some pages.

Guesses at good values for an unlimited traversal: 200000 and 20000. You want the doneLimit pretty small because the hash-table gets checked for every URL, so it will be mostly in memory; the todo queue, on the other hand, is only accessed at the front and back, and so will be mostly paged out.

Parameters:
todoLimit - maximum number of URLs to queue for examination
doneLimit - maximum number of URLs to remember having done already
err - the error stream

Spider

public Spider(int todoLimit,
              int doneLimit)
Constructor with size limits.
Parameters:
todoLimit - maximum number of URLs to queue for examination
doneLimit - maximum number of URLs to remember having done already
Method Detail

addUrl

public void addUrl(java.lang.String urlStr)
            throws java.net.MalformedURLException
Add a URL to the to-do list.
Parameters:
urlStr - URL to be added.
Throws:
java.net.MalformedURLException - if the urlStr does not contain a valid URL.

setAuth

public void setAuth(java.lang.String authCookie)
Set the authorization cookie.
Parameters:
authCookie - Syntax is userid:password.

addObserver

public void addObserver(Acme.HtmlObserver observer)
Add an extra observer to the scanners we make. Multiple observers get called in the order they were added.

Alternatively, if you want to add a different observer to each scanner, you can cast the input stream to a scanner and call its add routine, like so:

 InputStream s = conn.getInputStream();
 Acme.HtmlScanner scanner = (Acme.HtmlScanner) s;
 scanner.addObserver( this );
 
Parameters:
observer - Observer to be added.

doThisUrl

protected boolean doThisUrl(java.lang.String thisUrlStr,
                            int depth,
                            java.lang.String baseUrlStr)
This method can be overridden by a subclass if you want to change the search policy. The default version only does URLs that start with the same string as the base URL. An alternate version might instead go by the search depth.
Parameters:
thisUrlStr - URL to be added.
depth - Distance to baseURL.
baseUrlStr - Base URL of this URL.
Returns:
if the URL should be processed.

brokenLink

protected void brokenLink(java.lang.String fromUrlStr,
                          java.lang.String toUrlStr,
                          java.lang.String errmsg)
This method can be overridden by a subclass if you want to change the broken link policy. The default version reports the broken link on the error stream. An alternate version might attempt to send mail to the owner of the page with the broken link.
Parameters:
fromUrlStr - the URL containing the broken link.
toUrlStr - the broken URL.
errmsg - Message containing further information.

reportError

protected void reportError(java.lang.String fromUrlStr,
                           java.lang.String toUrlStr,
                           java.lang.String errmsg)
This method can be overridden by a subclass if you want to change the error reporting policy. The default version reports the error link on the error stream. An alternate version might ignore the error.
Parameters:
fromUrlStr - the URL containing the broken url.
toUrlStr - the broken URL.
errmsg - Message containing further information.

hasMoreElements

public boolean hasMoreElements()
Standard Enumeration method.
Specified by:
hasMoreElements in interface java.util.Enumeration
Returns:
if there are more Elements in the Queue.

nextElement

public java.lang.Object nextElement()
Standard Enumeration method.
Specified by:
nextElement in interface java.util.Enumeration
Returns:
the next Element in the URLConnection Enumeration.

gotAHREF

public void gotAHREF(java.lang.String urlStr,
                     java.net.URL contextUrl,
                     java.lang.Object clientData)
Acme.HtmlObserver callback.
Specified by:
gotAHREF in interface Acme.HtmlObserver
Parameters:
urlStr - URL of the href.
contextUrl - URL the link was found in.
clientData - Information about the URL.

gotIMGSRC

public void gotIMGSRC(java.lang.String urlStr,
                      java.net.URL contextUrl,
                      java.lang.Object clientData)
Acme.HtmlObserver callback.
Specified by:
gotIMGSRC in interface Acme.HtmlObserver
Parameters:
urlStr - URL of the href.
contextUrl - URL the link was found in.
clientData - Information about the URL.

gotFRAMESRC

public void gotFRAMESRC(java.lang.String urlStr,
                        java.net.URL contextUrl,
                        java.lang.Object clientData)
Acme.HtmlObserver callback.
Specified by:
gotFRAMESRC in interface Acme.HtmlObserver
Parameters:
urlStr - URL of the href.
contextUrl - URL the link was found in.
clientData - Information about the URL.

gotBASEHREF

public void gotBASEHREF(java.lang.String urlStr,
                        java.net.URL contextUrl,
                        java.lang.Object clientData)
Acme.HtmlObserver callback.
Specified by:
gotBASEHREF in interface Acme.HtmlObserver
Parameters:
urlStr - URL of the href.
contextUrl - URL the link was found in.
clientData - Information about the URL.

gotAREAHREF

public void gotAREAHREF(java.lang.String urlStr,
                        java.net.URL contextUrl,
                        java.lang.Object clientData)
Acme.HtmlObserver callback.
Specified by:
gotAREAHREF in interface Acme.HtmlObserver
Parameters:
urlStr - URL of the href.
contextUrl - URL the link was found in.
clientData - Information about the URL.

gotLINKHREF

public void gotLINKHREF(java.lang.String urlStr,
                        java.net.URL contextUrl,
                        java.lang.Object clientData)
Acme.HtmlObserver callback.
Specified by:
gotLINKHREF in interface Acme.HtmlObserver
Parameters:
urlStr - URL of the href.
contextUrl - URL the link was found in.
clientData - Information about the URL.

gotBODYBACKGROUND

public void gotBODYBACKGROUND(java.lang.String urlStr,
                              java.net.URL contextUrl,
                              java.lang.Object clientData)
Acme.HtmlObserver callback.
Specified by:
gotBODYBACKGROUND in interface Acme.HtmlObserver
Parameters:
urlStr - URL of the href.
contextUrl - URL the link was found in.
clientData - Information about the URL.

main

public static void main(java.lang.String[] args)
Test program. Shows URLs, file sizes, etc. at the ACME Java site.
Parameters:
args - URL to start Spider on.