org.htmlparser.parserapplications

Class StringExtractor


public class StringExtractor
extends Object

Extract plaintext strings from a web page. Illustrative program to gather the textual contents of a web page. Uses a StringBean to accumulate the user visible text (what a browser would display) into a single string.

Constructor Summary

StringExtractor(String resource)
Construct a StringExtractor to read from the given resource.

Method Summary

String
extractStrings(boolean links)
Extract the text from a page.
static void
main(String[] args)
Mainline.

Constructor Details

StringExtractor

public StringExtractor(String resource)
Construct a StringExtractor to read from the given resource.
Parameters:
resource - Either a URL or a file name.

Method Details

extractStrings

public String extractStrings(boolean links)
            throws ParserException
Extract the text from a page.
Parameters:
links - if true include hyperlinks in output.
Returns:
The textual contents of the page.
Throws:
ParserException - If a parse error occurs.

main

public static void main(String[] args)
Mainline.
Parameters:
args - The command line arguments.

HTML Parser is an open source library released under LGPL. SourceForge.net