|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--de.kosi.utiltest.Spider
A web-robot class.
This is an Enumeration class that traverses the web starting at a given URL. It fetches HTML files and parses them for new URLs to look at. All files it encounters, HTML or otherwise, are returned by the nextElement() method as a URLConnection.
The traversal is breadth-first, and by default it is limited to files at or below the starting point - same protocol, hostname, and initial directory.
Because of the security restrictions on applets, this is currently only useful from applications.
Sample code:
There are also a couple of methods you can override via a subclass, to control things like the search limits and what gets done with broken links.Enumeration spider = new Acme.Spider( "http://some.site.com/whatever/" ); while ( spider.hasMoreElements() ) { URLConnection conn = (URLConnection) spider.nextElement(); // Then do whatever you like with conn: URL thisUrl = conn.getURL(); String thisUrlStr = thisUrl.toExternalForm(); String mimeType = conn.getContentType(); long changed = conn.getLastModified(); InputStream s = conn.getInputStream(); // Etc. etc. etc., your code here. }
Constructor Summary | |
Spider()
Constructor with no size limits, and the default error stream. |
|
Spider(int todoLimit,
int doneLimit)
Constructor with size limits. |
|
Spider(int todoLimit,
int doneLimit,
java.io.PrintStream err)
Constructor with size limits. |
|
Spider(java.io.PrintStream err)
Constructor with no size limits. |
|
Spider(java.lang.String urlStr)
Constructor with a single URL and no size limits, and the default error stream. |
|
Spider(java.lang.String urlStr,
java.io.PrintStream err)
Constructor with a single URL and no size limits. |
Method Summary | |
void |
addObserver(Acme.HtmlObserver observer)
Add an extra observer to the scanners we make. |
void |
addUrl(java.lang.String urlStr)
Add a URL to the to-do list. |
protected void |
brokenLink(java.lang.String fromUrlStr,
java.lang.String toUrlStr,
java.lang.String errmsg)
This method can be overridden by a subclass if you want to change the broken link policy. |
protected boolean |
doThisUrl(java.lang.String thisUrlStr,
int depth,
java.lang.String baseUrlStr)
This method can be overridden by a subclass if you want to change the search policy. |
void |
gotAHREF(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
Acme.HtmlObserver callback. |
void |
gotAREAHREF(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
Acme.HtmlObserver callback. |
void |
gotBASEHREF(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
Acme.HtmlObserver callback. |
void |
gotBODYBACKGROUND(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
Acme.HtmlObserver callback. |
void |
gotFRAMESRC(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
Acme.HtmlObserver callback. |
void |
gotIMGSRC(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
Acme.HtmlObserver callback. |
void |
gotLINKHREF(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
Acme.HtmlObserver callback. |
boolean |
hasMoreElements()
Standard Enumeration method. |
static void |
main(java.lang.String[] args)
Test program. |
java.lang.Object |
nextElement()
Standard Enumeration method. |
protected void |
reportError(java.lang.String fromUrlStr,
java.lang.String toUrlStr,
java.lang.String errmsg)
This method can be overridden by a subclass if you want to change the error reporting policy. |
void |
setAuth(java.lang.String authCookie)
Set the authorization cookie. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
public Spider(java.io.PrintStream err)
err
- the error streampublic Spider()
public Spider(java.lang.String urlStr, java.io.PrintStream err) throws java.net.MalformedURLException
urlStr
- the URL to start off the enumerationerr
- the error streamjava.net.MalformedURLException
- if the urlStr does not contain a valid URL.public Spider(java.lang.String urlStr) throws java.net.MalformedURLException
urlStr
- the URL to start off the enumerationjava.net.MalformedURLException
- if the urlStr does not contain a valid URL.public Spider(int todoLimit, int doneLimit, java.io.PrintStream err)
Guesses at good values for an unlimited traversal: 200000 and 20000. You want the doneLimit pretty small because the hash-table gets checked for every URL, so it will be mostly in memory; the todo queue, on the other hand, is only accessed at the front and back, and so will be mostly paged out.
todoLimit
- maximum number of URLs to queue for examinationdoneLimit
- maximum number of URLs to remember having done alreadyerr
- the error streampublic Spider(int todoLimit, int doneLimit)
todoLimit
- maximum number of URLs to queue for examinationdoneLimit
- maximum number of URLs to remember having done alreadyMethod Detail |
public void addUrl(java.lang.String urlStr) throws java.net.MalformedURLException
urlStr
- URL to be added.java.net.MalformedURLException
- if the urlStr does not contain a valid URL.public void setAuth(java.lang.String authCookie)
authCookie
- Syntax is userid:password.public void addObserver(Acme.HtmlObserver observer)
Alternatively, if you want to add a different observer to each scanner, you can cast the input stream to a scanner and call its add routine, like so:
InputStream s = conn.getInputStream();
Acme.HtmlScanner scanner = (Acme.HtmlScanner) s;
scanner.addObserver( this );
observer
- Observer to be added.protected boolean doThisUrl(java.lang.String thisUrlStr, int depth, java.lang.String baseUrlStr)
thisUrlStr
- URL to be added.depth
- Distance to baseURL.baseUrlStr
- Base URL of this URL.protected void brokenLink(java.lang.String fromUrlStr, java.lang.String toUrlStr, java.lang.String errmsg)
fromUrlStr
- the URL containing the broken link.toUrlStr
- the broken URL.errmsg
- Message containing further information.protected void reportError(java.lang.String fromUrlStr, java.lang.String toUrlStr, java.lang.String errmsg)
fromUrlStr
- the URL containing the broken url.toUrlStr
- the broken URL.errmsg
- Message containing further information.public boolean hasMoreElements()
hasMoreElements
in interface java.util.Enumeration
public java.lang.Object nextElement()
nextElement
in interface java.util.Enumeration
public void gotAHREF(java.lang.String urlStr, java.net.URL contextUrl, java.lang.Object clientData)
gotAHREF
in interface Acme.HtmlObserver
urlStr
- URL of the href.contextUrl
- URL the link was found in.clientData
- Information about the URL.public void gotIMGSRC(java.lang.String urlStr, java.net.URL contextUrl, java.lang.Object clientData)
gotIMGSRC
in interface Acme.HtmlObserver
urlStr
- URL of the href.contextUrl
- URL the link was found in.clientData
- Information about the URL.public void gotFRAMESRC(java.lang.String urlStr, java.net.URL contextUrl, java.lang.Object clientData)
gotFRAMESRC
in interface Acme.HtmlObserver
urlStr
- URL of the href.contextUrl
- URL the link was found in.clientData
- Information about the URL.public void gotBASEHREF(java.lang.String urlStr, java.net.URL contextUrl, java.lang.Object clientData)
gotBASEHREF
in interface Acme.HtmlObserver
urlStr
- URL of the href.contextUrl
- URL the link was found in.clientData
- Information about the URL.public void gotAREAHREF(java.lang.String urlStr, java.net.URL contextUrl, java.lang.Object clientData)
gotAREAHREF
in interface Acme.HtmlObserver
urlStr
- URL of the href.contextUrl
- URL the link was found in.clientData
- Information about the URL.public void gotLINKHREF(java.lang.String urlStr, java.net.URL contextUrl, java.lang.Object clientData)
gotLINKHREF
in interface Acme.HtmlObserver
urlStr
- URL of the href.contextUrl
- URL the link was found in.clientData
- Information about the URL.public void gotBODYBACKGROUND(java.lang.String urlStr, java.net.URL contextUrl, java.lang.Object clientData)
gotBODYBACKGROUND
in interface Acme.HtmlObserver
urlStr
- URL of the href.contextUrl
- URL the link was found in.clientData
- Information about the URL.public static void main(java.lang.String[] args)
args
- URL to start Spider on.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |