org.htmlparser.parserapplications
Class SiteCapturer
public class SiteCapturer
Save a web site locally.
Illustrative program to save a web site contents locally.
It was created to demonstrate URL rewriting in it's simplest form.
It uses customized tags in the NodeFactory to alter the URLs.
This program has a number of limitations:
- it doesn't capture forms, this would involve too many assumptions
- it doesn't capture script references, so funky onMouseOver and other
non-static content will not be faithfully reproduced
- it doesn't handle style sheets
- it doesn't dig into attributes that might reference resources, so
for example, background images won't necessarily be captured
- worst of all, it gets confused when a URL both has content and is
the prefix for other content,
i.e. http://whatever.com/top and http://whatever.com/top/sub.html both
yield content, since this cannot be faithfully replicated to a static
directory structure (this happens a lot with servlet based sites)
protected int | TRANSFER_SIZE - Copy buffer size.
|
protected boolean | mCaptureResources - If
true , save resources locally too,
otherwise, leave resource links pointing to original page.
|
protected HashSet | mCopied - The set of resources already copied.
|
protected NodeFilter | mFilter - The filter to apply to the nodes retrieved.
|
protected HashSet | mFinished - The set of pages already captured.
|
protected ArrayList | mImages - The list of resources to copy.
|
protected ArrayList | mPages - The list of pages to capture.
|
protected Parser | mParser - The parser to use for processing.
|
protected String | mSource - The web site to capture.
|
protected String | mTarget - The local directory to capture to.
|
void | capture() - Perform the capture.
|
protected void | copy() - Copy a resource (image) locally.
|
protected String | decode(String raw) - Unescape a URL to form a file name.
|
boolean | getCaptureResources() - Getter for property captureResources.
|
NodeFilter | getFilter() - Getter for property filter.
|
String | getSource() - Getter for property source.
|
String | getTarget() - Getter for property target.
|
protected boolean | isHtml(String link) - Returns
true if the link contains text/html content.
|
protected boolean | isToBeCaptured(String link) - Returns
true if the link is one we are interested in.
|
static void | main(String[] args) - Mainline to capture a web site locally.
|
protected String | makeLocalLink(String link, String current) - Converts a link to local.
|
protected void | process(NodeFilter filter) - Process a single page.
|
void | setCaptureResources(boolean capture) - Setter for property captureResources.
|
void | setFilter(NodeFilter filter) - Setter for property filter.
|
void | setSource(String source) - Setter for property source.
|
void | setTarget(String target) - Setter for property target.
|
TRANSFER_SIZE
protected final int TRANSFER_SIZE
Copy buffer size.
Resources are moved to disk in chunks this size or less.
mCaptureResources
protected boolean mCaptureResources
If true
, save resources locally too,
otherwise, leave resource links pointing to original page.
mCopied
protected HashSet mCopied
The set of resources already copied.
Used to avoid repeated acquisition of the same images and other resources.
mFilter
protected NodeFilter mFilter
The filter to apply to the nodes retrieved.
mFinished
protected HashSet mFinished
The set of pages already captured.
Used to avoid repeated acquisition of the same page.
mImages
protected ArrayList mImages
The list of resources to copy.
Images and other resources are added to this list as they are discovered.
mPages
protected ArrayList mPages
The list of pages to capture.
Links are added to this list as they are discovered, and removed in
sequential order (FIFO queue) leading to a breadth
first traversal of the web site space.
mParser
protected Parser mParser
The parser to use for processing.
mSource
protected String mSource
The web site to capture.
This is used as the base URL in deciding whether to adjust a link
and whether to capture a page or not.
mTarget
protected String mTarget
The local directory to capture to.
This is used as a base prefix for files saved locally.
SiteCapturer
public SiteCapturer()
Create a web site capturer.
capture
public void capture()
Perform the capture.
copy
protected void copy()
Copy a resource (image) locally.
Removes one element from the 'to be copied' list and saves the
resource it points to locally as a file.
decode
protected String decode(String raw)
Unescape a URL to form a file name.
Very crude.
getCaptureResources
public boolean getCaptureResources()
Getter for property captureResources.
If true
, the images and other resources referenced by
the site and within the base URL tree are also copied locally to the
target directory. If false
, the image links are left 'as
is', still refering to the original site.
- Value of property captureResources.
getFilter
public NodeFilter getFilter()
Getter for property filter.
- Value of property filter.
getSource
public String getSource()
Getter for property source.
- Value of property source.
getTarget
public String getTarget()
Getter for property target.
- Value of property target.
isHtml
protected boolean isHtml(String link)
throws ParserException
Returns true
if the link contains text/html content.
link
- The URL to check for content type.
true
if the HTTP header indicates the type is
"text/html".
isToBeCaptured
protected boolean isToBeCaptured(String link)
Returns true
if the link is one we are interested in.
link
- The link to be checked.
true
if the link has the source URL as a prefix
and doesn't contain '?' or '#'; the former because we won't be able to
handle server side queries in the static target directory structure and
the latter because presumably the full page with that reference has
already been captured previously. This performs a case insensitive
comparison, which is cheating really, but it's cheap.
main
public static void main(String[] args)
throws MalformedURLException,
IOException
Mainline to capture a web site locally.
args
- The command line arguments.
There are three arguments the web site to capture, the local directory
to save it to, and a flag (true or false) to indicate whether resources
such as images and video are to be captured as well.
These are requested via dialog boxes if not supplied.
makeLocalLink
protected String makeLocalLink(String link,
String current)
Converts a link to local.
A relative link can be used to construct both a URL and a file name.
Basically, the operation is to strip off the base url, if any,
and then prepend as many dot-dots as necessary to make
it relative to the current page.
A bit of a kludge handles the root page specially by calling it
index.html, even though that probably isn't it's real file name.
This isn't pretty, but it works for me.
link
- The link to make relative.current
- The current page URL, or empty if it's an absolute URL
that needs to be converted.
- The URL relative to the current page.
process
protected void process(NodeFilter filter)
throws ParserException
Process a single page.
filter
- The filter to apply to the collected nodes.
setCaptureResources
public void setCaptureResources(boolean capture)
Setter for property captureResources.
capture
- New value of property captureResources.
setFilter
public void setFilter(NodeFilter filter)
Setter for property filter.
filter
- New value of property filter.
setSource
public void setSource(String source)
Setter for property source.
This is the base URL to capture. URL's that don't start with this prefix
are ignored (left as is), while the ones with this URL as a base are
re-homed to the local target.
source
- New value of property source.
setTarget
public void setTarget(String target)
Setter for property target.
This is the local directory under which to save the site's pages.
target
- New value of property target.
| © 2005 Derrick Oswald Mai 08, 2008 |
HTML Parser is an open source library released under LGPL. |  |