org.htmlparser.beans
Class StringBean
- Serializable
public class StringBean
implements Serializable
Extract strings from a URL.
Text within <SCRIPT></SCRIPT> tags is removed.
The text within <PRE></PRE> tags is not altered.
The property
Strings
, which is the output property is null
until a URL is set. So a typical usage is:
StringBean sb = new StringBean ();
sb.setLinks (false);
sb.setReplaceNonBreakingSpaces (true);
sb.setCollapse (true);
sb.setURL ("http://www.netbeans.org"); // the HTTP is performed here
String s = sb.getStrings ();
You can also use the StringBean as a NodeVisitor on your own parser,
in which case you have to refetch your page if you change one of the
properties because it resets the Strings property:
StringBean sb = new StringBean ();
Parser parser = new Parser ("http://cbc.ca");
parser.visitAllNodesWith (sb);
String s = sb.getStrings ();
sb.setLinks (true);
parser.reset ();
parser.visitAllNodesWith (sb);
String sl = sb.getStrings ();
According to Nick Burch, who contributed the patch, this is handy if you
don't want StringBean to wander off and get the content itself, either
because you already have it, it's not on a website etc.
static String | PROP_COLLAPSE_PROPERTY - Property name in event where the 'collapse whitespace' state changes.
|
static String | PROP_CONNECTION_PROPERTY - Property name in event where the connection changes.
|
static String | PROP_LINKS_PROPERTY - Property name in event where the 'embed links' state changes.
|
static String | PROP_REPLACE_SPACE_PROPERTY - Property name in event where the 'replace non-breaking spaces'
state changes.
|
static String | PROP_STRINGS_PROPERTY - Property name in event where the URL contents changes.
|
static String | PROP_URL_PROPERTY - Property name in event where the URL changes.
|
protected StringBuffer | mBuffer - The buffer text is stored in while traversing the HTML.
|
protected boolean | mCollapse - If
true sequences of whitespace characters are replaced
with a single space character.
|
protected int | mCollapseState - The state of the collapse processiung state machine.
|
protected boolean | mIsPre - Set
true when traversing a PRE tag.
|
protected boolean | mIsScript - Set
true when traversing a SCRIPT tag.
|
protected boolean | mIsStyle - Set
true when traversing a STYLE tag.
|
protected boolean | mLinks - If
true the link URLs are embedded in the text output.
|
protected Parser | mParser - The parser used to extract strings.
|
protected PropertyChangeSupport | mPropertySupport - Bound property support.
|
protected boolean | mReplaceSpace - If
true regular space characters are substituted for
non-breaking spaces in the text output.
|
protected String | mStrings - The strings extracted from the URL.
|
void | addPropertyChangeListener(PropertyChangeListener listener) - Add a PropertyChangeListener to the listener list.
|
protected void | carriageReturn() - Appends a newline to the buffer if there isn't one there already.
|
protected void | collapse(StringBuffer buffer, String string) - Add the given text collapsing whitespace.
|
protected String | extractStrings() - Extract the text from a page.
|
boolean | getCollapse() - Get the current 'collapse whitespace' state.
|
URLConnection | getConnection() - Get the current connection.
|
boolean | getLinks() - Get the current 'include links' state.
|
boolean | getReplaceNonBreakingSpaces() - Get the current 'replace non breaking spaces' state.
|
String | getStrings() - Return the textual contents of the URL.
|
String | getURL() - Get the current URL.
|
static void | main(String[] args) - Unit test.
|
void | removePropertyChangeListener(PropertyChangeListener listener) - Remove a PropertyChangeListener from the listener list.
|
void | setCollapse(boolean collapse) - Set the current 'collapse whitespace' state.
|
void | setConnection(URLConnection connection) - Set the parser's connection.
|
void | setLinks(boolean links) - Set the 'include links' state.
|
void | setReplaceNonBreakingSpaces(boolean replace) - Set the 'replace non breaking spaces' state.
|
protected void | setStrings() - Fetch the URL contents.
|
void | setURL(String url) - Set the URL to extract strings from.
|
protected void | updateStrings(String strings) - Assign the
Strings property, firing the property change.
|
void | visitEndTag(Tag tag) - Resets the state of the PRE and SCRIPT flags.
|
void | visitStringNode(Text string) - Appends the text to the output.
|
void | visitTag(Tag tag) - Appends a NEWLINE to the output if the tag breaks flow, and
possibly sets the state of the PRE and SCRIPT flags.
|
PROP_COLLAPSE_PROPERTY
public static final String PROP_COLLAPSE_PROPERTY
Property name in event where the 'collapse whitespace' state changes.
PROP_CONNECTION_PROPERTY
public static final String PROP_CONNECTION_PROPERTY
Property name in event where the connection changes.
PROP_LINKS_PROPERTY
public static final String PROP_LINKS_PROPERTY
Property name in event where the 'embed links' state changes.
PROP_REPLACE_SPACE_PROPERTY
public static final String PROP_REPLACE_SPACE_PROPERTY
Property name in event where the 'replace non-breaking spaces'
state changes.
PROP_STRINGS_PROPERTY
public static final String PROP_STRINGS_PROPERTY
Property name in event where the URL contents changes.
PROP_URL_PROPERTY
public static final String PROP_URL_PROPERTY
Property name in event where the URL changes.
mBuffer
protected StringBuffer mBuffer
The buffer text is stored in while traversing the HTML.
mCollapse
protected boolean mCollapse
If true
sequences of whitespace characters are replaced
with a single space character.
mCollapseState
protected int mCollapseState
The state of the collapse processiung state machine.
mIsPre
protected boolean mIsPre
Set true
when traversing a PRE tag.
mIsScript
protected boolean mIsScript
Set true
when traversing a SCRIPT tag.
mIsStyle
protected boolean mIsStyle
Set true
when traversing a STYLE tag.
mLinks
protected boolean mLinks
If true
the link URLs are embedded in the text output.
mParser
protected Parser mParser
The parser used to extract strings.
mPropertySupport
protected PropertyChangeSupport mPropertySupport
Bound property support.
mReplaceSpace
protected boolean mReplaceSpace
If true
regular space characters are substituted for
non-breaking spaces in the text output.
mStrings
protected String mStrings
The strings extracted from the URL.
StringBean
public StringBean()
Create a StringBean object.
Default property values are set to 'do the right thing':
Links
is set
false
so text appears like a
browser would display it, albeit without the colour or underline clues
normally associated with a link.
ReplaceNonBreakingSpaces
is set
true
, so
that printing the text works, but the extra information regarding these
formatting marks is available if you set it false.
Collapse
is set
true
, so text appears
compact like a browser would display it.
addPropertyChangeListener
public void addPropertyChangeListener(PropertyChangeListener listener)
Add a PropertyChangeListener to the listener list.
The listener is registered for all properties.
listener
- The PropertyChangeListener to be added.
carriageReturn
protected void carriageReturn()
Appends a newline to the buffer if there isn't one there already.
Except if the buffer is empty.
collapse
protected void collapse(StringBuffer buffer,
String string)
Add the given text collapsing whitespace.
Use a little finite state machine:
state 0: whitepace was last emitted character
state 1: in whitespace
state 2: in word
A whitespace character moves us to state 1 and any other character
moves us to state 2, except that state 0 stays in state 0 until
a non-whitespace and going from whitespace to word we emit a space
before the character:
input: whitespace other-character
state\next
0 0 2
1 1 space then 2
2 1 2
buffer
- The buffer to append to.string
- The string to append.
extractStrings
protected String extractStrings()
throws ParserException
Extract the text from a page.
- The textual contents of the page.
getCollapse
public boolean getCollapse()
Get the current 'collapse whitespace' state.
If set to
true
this emulates the operation of browsers
in interpretting text where
user agents should collapse input
white space sequences when producing output inter-word space
.
See HTML specification section 9.1 White space
http://www.w3.org/TR/html4/struct/text.html#h-9.1.
true
if sequences of whitespace (space '\u0020',
tab '\u0009', form feed '\u000C', zero-width space '\u200B',
carriage-return '\r' and NEWLINE '\n') are to be replaced with a single
space.
getConnection
public URLConnection getConnection()
Get the current connection.
- The connection that the parser has or
null
if it
hasn't been set or the parser hasn't been constructed yet.
getLinks
public boolean getLinks()
Get the current 'include links' state.
true
if link text is included in the text extracted
from the URL, false
otherwise.
getReplaceNonBreakingSpaces
public boolean getReplaceNonBreakingSpaces()
Get the current 'replace non breaking spaces' state.
true
if non-breaking spaces (character '\u00a0',
numeric character reference   or character entity
reference ) are to be replaced with normal
spaces (character '\u0020').
getStrings
public String getStrings()
Return the textual contents of the URL.
This is the primary output of the bean.
- The user visible (what would be seen in a browser) text.
getURL
public String getURL()
Get the current URL.
- The URL from which text has been extracted, or
null
if this property has not been set yet.
main
public static void main(String[] args)
Unit test.
args
- Pass arg[0] as the URL to process.
removePropertyChangeListener
public void removePropertyChangeListener(PropertyChangeListener listener)
Remove a PropertyChangeListener from the listener list.
This removes a registered PropertyChangeListener.
listener
- The PropertyChangeListener to be removed.
setCollapse
public void setCollapse(boolean collapse)
Set the current 'collapse whitespace' state.
If the setting is changed after the URL has been set, the text from the
URL will be reacquired, which is possibly expensive.
The internal state of the collapse state machine can be reset with
code like this:
setCollapse (getCollapse ());
collapse
- If true
, sequences of whitespace
will be reduced to a single space.
setConnection
public void setConnection(URLConnection connection)
Set the parser's connection.
The text from the URL will be fetched, which may be expensive, so this
property should be set last.
connection
- New value of property Connection.
setLinks
public void setLinks(boolean links)
Set the 'include links' state.
If the setting is changed after the URL has been set, the text from the
URL will be reacquired, which is possibly expensive.
links
- Use true
if link text is to be included in the
text extracted from the URL, false
otherwise.
setReplaceNonBreakingSpaces
public void setReplaceNonBreakingSpaces(boolean replace)
Set the 'replace non breaking spaces' state.
If the setting is changed after the URL has been set, the text from the
URL will be reacquired, which is possibly expensive.
replace
- true
if non-breaking spaces
(character '\u00a0', numeric character reference  
or character entity reference ) are to be replaced with normal
spaces (character '\u0020').
setStrings
protected void setStrings()
Fetch the URL contents.
Only do work if there is a valid parser with it's URL set.
setURL
public void setURL(String url)
Set the URL to extract strings from.
The text from the URL will be fetched, which may be expensive, so this
property should be set last.
url
- The URL that text should be fetched from.
updateStrings
protected void updateStrings(String strings)
Assign the Strings
property, firing the property change.
strings
- The new value of the Strings
property.
visitEndTag
public void visitEndTag(Tag tag)
Resets the state of the PRE and SCRIPT flags.
- visitEndTag in interface NodeVisitor
tag
- The end tag to process.
visitTag
public void visitTag(Tag tag)
Appends a NEWLINE to the output if the tag breaks flow, and
possibly sets the state of the PRE and SCRIPT flags.
- visitTag in interface NodeVisitor
tag
- The tag to examine.
| © 2005 Derrick Oswald Mai 08, 2008 |
HTML Parser is an open source library released under LGPL. |  |