|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Object | +--de.kosi.utiltest.Spider
A web-robot class.
This is an Enumeration class that traverses the web starting at a given URL. It fetches HTML files and parses them for new URLs to look at. All files it encounters, HTML or otherwise, are returned by the nextElement() method as a URLConnection.
The traversal is breadth-first, and by default it is limited to files at or below the starting point - same protocol, hostname, and initial directory.
Because of the security restrictions on applets, this is currently only useful from applications.
Sample code:
Enumeration spider = new Acme.Spider( "http://some.site.com/whatever/" );
while ( spider.hasMoreElements() )
{
URLConnection conn = (URLConnection) spider.nextElement();
// Then do whatever you like with conn:
URL thisUrl = conn.getURL();
String thisUrlStr = thisUrl.toExternalForm();
String mimeType = conn.getContentType();
long changed = conn.getLastModified();
InputStream s = conn.getInputStream();
// Etc. etc. etc., your code here.
}
There are also a couple of methods you can override via a subclass, to
control things like the search limits and what gets done with broken links.
| Constructor Summary | |
Spider()
Constructor with no size limits, and the default error stream. |
|
Spider(int todoLimit,
int doneLimit)
Constructor with size limits. |
|
Spider(int todoLimit,
int doneLimit,
java.io.PrintStream err)
Constructor with size limits. |
|
Spider(java.io.PrintStream err)
Constructor with no size limits. |
|
Spider(java.lang.String urlStr)
Constructor with a single URL and no size limits, and the default error stream. |
|
Spider(java.lang.String urlStr,
java.io.PrintStream err)
Constructor with a single URL and no size limits. |
|
| Method Summary | |
void |
addObserver(Acme.HtmlObserver observer)
Add an extra observer to the scanners we make. |
void |
addUrl(java.lang.String urlStr)
Add a URL to the to-do list. |
protected void |
brokenLink(java.lang.String fromUrlStr,
java.lang.String toUrlStr,
java.lang.String errmsg)
This method can be overridden by a subclass if you want to change the broken link policy. |
protected boolean |
doThisUrl(java.lang.String thisUrlStr,
int depth,
java.lang.String baseUrlStr)
This method can be overridden by a subclass if you want to change the search policy. |
void |
gotAHREF(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
Acme.HtmlObserver callback. |
void |
gotAREAHREF(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
Acme.HtmlObserver callback. |
void |
gotBASEHREF(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
Acme.HtmlObserver callback. |
void |
gotBODYBACKGROUND(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
Acme.HtmlObserver callback. |
void |
gotFRAMESRC(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
Acme.HtmlObserver callback. |
void |
gotIMGSRC(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
Acme.HtmlObserver callback. |
void |
gotLINKHREF(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
Acme.HtmlObserver callback. |
boolean |
hasMoreElements()
Standard Enumeration method. |
static void |
main(java.lang.String[] args)
Test program. |
java.lang.Object |
nextElement()
Standard Enumeration method. |
protected void |
reportError(java.lang.String fromUrlStr,
java.lang.String toUrlStr,
java.lang.String errmsg)
This method can be overridden by a subclass if you want to change the error reporting policy. |
void |
setAuth(java.lang.String authCookie)
Set the authorization cookie. |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
public Spider(java.io.PrintStream err)
err - the error streampublic Spider()
public Spider(java.lang.String urlStr,
java.io.PrintStream err)
throws java.net.MalformedURLException
urlStr - the URL to start off the enumerationerr - the error streamjava.net.MalformedURLException - if the urlStr does not contain a valid URL.
public Spider(java.lang.String urlStr)
throws java.net.MalformedURLException
urlStr - the URL to start off the enumerationjava.net.MalformedURLException - if the urlStr does not contain a valid URL.
public Spider(int todoLimit,
int doneLimit,
java.io.PrintStream err)
Guesses at good values for an unlimited traversal: 200000 and 20000. You want the doneLimit pretty small because the hash-table gets checked for every URL, so it will be mostly in memory; the todo queue, on the other hand, is only accessed at the front and back, and so will be mostly paged out.
todoLimit - maximum number of URLs to queue for examinationdoneLimit - maximum number of URLs to remember having done alreadyerr - the error stream
public Spider(int todoLimit,
int doneLimit)
todoLimit - maximum number of URLs to queue for examinationdoneLimit - maximum number of URLs to remember having done already| Method Detail |
public void addUrl(java.lang.String urlStr)
throws java.net.MalformedURLException
urlStr - URL to be added.java.net.MalformedURLException - if the urlStr does not contain a valid URL.public void setAuth(java.lang.String authCookie)
authCookie - Syntax is userid:password.public void addObserver(Acme.HtmlObserver observer)
Alternatively, if you want to add a different observer to each scanner, you can cast the input stream to a scanner and call its add routine, like so:
InputStream s = conn.getInputStream();
Acme.HtmlScanner scanner = (Acme.HtmlScanner) s;
scanner.addObserver( this );
observer - Observer to be added.
protected boolean doThisUrl(java.lang.String thisUrlStr,
int depth,
java.lang.String baseUrlStr)
thisUrlStr - URL to be added.depth - Distance to baseURL.baseUrlStr - Base URL of this URL.
protected void brokenLink(java.lang.String fromUrlStr,
java.lang.String toUrlStr,
java.lang.String errmsg)
fromUrlStr - the URL containing the broken link.toUrlStr - the broken URL.errmsg - Message containing further information.
protected void reportError(java.lang.String fromUrlStr,
java.lang.String toUrlStr,
java.lang.String errmsg)
fromUrlStr - the URL containing the broken url.toUrlStr - the broken URL.errmsg - Message containing further information.public boolean hasMoreElements()
hasMoreElements in interface java.util.Enumerationpublic java.lang.Object nextElement()
nextElement in interface java.util.Enumeration
public void gotAHREF(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
gotAHREF in interface Acme.HtmlObserverurlStr - URL of the href.contextUrl - URL the link was found in.clientData - Information about the URL.
public void gotIMGSRC(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
gotIMGSRC in interface Acme.HtmlObserverurlStr - URL of the href.contextUrl - URL the link was found in.clientData - Information about the URL.
public void gotFRAMESRC(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
gotFRAMESRC in interface Acme.HtmlObserverurlStr - URL of the href.contextUrl - URL the link was found in.clientData - Information about the URL.
public void gotBASEHREF(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
gotBASEHREF in interface Acme.HtmlObserverurlStr - URL of the href.contextUrl - URL the link was found in.clientData - Information about the URL.
public void gotAREAHREF(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
gotAREAHREF in interface Acme.HtmlObserverurlStr - URL of the href.contextUrl - URL the link was found in.clientData - Information about the URL.
public void gotLINKHREF(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
gotLINKHREF in interface Acme.HtmlObserverurlStr - URL of the href.contextUrl - URL the link was found in.clientData - Information about the URL.
public void gotBODYBACKGROUND(java.lang.String urlStr,
java.net.URL contextUrl,
java.lang.Object clientData)
gotBODYBACKGROUND in interface Acme.HtmlObserverurlStr - URL of the href.contextUrl - URL the link was found in.clientData - Information about the URL.public static void main(java.lang.String[] args)
args - URL to start Spider on.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||