This class parses the HTML stream into nodes.
There are three major types of nodes (lexemes):
Each time
nextNode()
is called, another node is returned until
the stream is exhausted, and
null
is returned.
STRICT_REMARKS
public static boolean STRICT_REMARKS
Process remarks strictly flag.
If true
, remarks are not terminated by ---$gt;
or --!$gt;, i.e. more than two dashes. If false
,
a more lax (and closer to typical browser handling) remark parsing
is used.
Default true
.
VERSION_DATE
public static final String VERSION_DATE
The date of the version ().
VERSION_NUMBER
public static final double VERSION_NUMBER
The floating point version number ().
VERSION_STRING
public static final String VERSION_STRING
The display version ().
VERSION_TYPE
public static final String VERSION_TYPE
The type of version ().
mCursor
protected Cursor mCursor
The current position on the page.
mDebugLineTrigger
protected static int mDebugLineTrigger
Line number to trigger on.
This is tested on each nextNode()
call, as a debugging aid.
Alter this value and set a breakpoint on the guarded statement.
Remember, these line numbers are zero based, while most editors are
one based.
mFactory
protected NodeFactory mFactory
The factory for new nodes.
mPage
protected Page mPage
The page lexemes are retrieved from.
createRemarkNode
public Remark createRemarkNode(Page page,
int start,
int end)
Create a new remark node.
- createRemarkNode in interface NodeFactory
page
- The page the node is on.start
- The beginning position of the remark.end
- The ending positiong of the remark.
createStringNode
public Text createStringNode(Page page,
int start,
int end)
Create a new string node.
- createStringNode in interface NodeFactory
page
- The page the node is on.start
- The beginning position of the string.end
- The ending positiong of the string.
createTagNode
public Tag createTagNode(Page page,
int start,
int end,
Vector attributes)
Create a new tag node.
Note that the attributes vector contains at least one element,
which is the tag name (standalone attribute) at position zero.
This can be used to decide which type of node to create, or
gate other processing that may be appropriate.
- createTagNode in interface NodeFactory
page
- The page the node is on.start
- The beginning position of the tag.end
- The ending positiong of the tag.attributes
- The attributes contained in this tag.
getCurrentLine
public String getCurrentLine()
Get the current line.
- The string the lexer's working on.
getCurrentLineNumber
public int getCurrentLineNumber()
Get the current line number.
- The line number the lexer's working on.
getCursor
public Cursor getCursor()
Get the current scanning position.
- The lexer's cursor position.
getNodeFactory
public NodeFactory getNodeFactory()
Get the current node factory.
- The lexer's node factory.
getPage
public Page getPage()
Get the page this lexer is working on.
- The page that nodes are being read from.
getPosition
public int getPosition()
Get the current cursor position.
- The current character offset into the source.
getVersion
public static String getVersion()
Return the version string of this parser.
- A string of the form:
"[floating point number] ([build-type] [build-date])"
main
public static void main(String[] args)
throws MalformedURLException,
ParserException
Mainline for command line operation
args
- [0] The URL to parse.
makeRemark
protected Node makeRemark(int start,
int end)
throws ParserException
Create a remark node based on the current cursor and the one provided.
start
- The starting point of the node.end
- The ending point of the node.
makeString
protected Node makeString(int start,
int end)
throws ParserException
Create a string node based on the current cursor and the one provided.
start
- The starting point of the node.end
- The ending point of the node.
makeTag
protected Node makeTag(int start,
int end,
Vector attributes)
throws ParserException
Create a tag node based on the current cursor and the one provided.
start
- The starting point of the node.end
- The ending point of the node.attributes
- The attributes parsed from the tag.
nextNode
public Node nextNode()
throws ParserException
Get the next node from the source.
- A Remark, Text or Tag, or
null
if no
more lexemes are present.
nextNode
public Node nextNode(boolean quotesmart)
throws ParserException
Get the next node from the source.
quotesmart
- If true
, strings ignore quoted contents.
- A Remark, Text or Tag, or
null
if no
more lexemes are present.
parseCDATA
public Node parseCDATA()
throws ParserException
Return CDATA as a text node.
According to appendix
B.3.2 Specifying non-HTML data of the
HTML 4.01 Specification:
Element content
When script or style data is the content of an element (SCRIPT and STYLE),
the data begins immediately after the element start tag and ends at the
first ETAGO ("</") delimiter followed by a name start character ([a-zA-Z]);
note that this may not be the element's end tag.
Authors should therefore escape "</" within the content. Escape mechanisms
are specific to each scripting or style sheet language.
- The
TextNode
of the CDATA or null
if none.
parseCDATA
public Node parseCDATA(boolean quotesmart)
throws ParserException
Return CDATA as a text node.
Slightly less rigid than
parseCDATA()
this method provides for
parsing CDATA that may contain quoted strings that have embedded
ETAGO ("</") delimiters and skips single and multiline comments.
quotesmart
- If true
the strict definition of CDATA is
extended to allow for single or double quoted ETAGO ("</") sequences.
- The
TextNode
of the CDATA or null
if none.
parseJsp
protected Node parseJsp(int start)
throws ParserException
Parse a java server page node.
Scan characters until "%>" is encountered, or the input stream is
exhausted, in which case null
is returned.
start
- The position at which to start scanning.
parsePI
protected Node parsePI(int start)
throws ParserException
Parse an XML processing instruction.
Scan characters until "?>" is encountered, or the input stream is
exhausted, in which case null
is returned.
start
- The position at which to start scanning.
parseRemark
protected Node parseRemark(int start,
boolean quotesmart)
throws ParserException
Parse a comment.
Parse a remark markup.
From the
HTML 4.01 Specification, W3C Recommendation 24 December 1999
http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.4
3.2.4 Comments
HTML comments have the following syntax:
<!-- this is a comment -->
<!-- and so is this one,
which occupies more than one line -->
White space is not permitted between the markup declaration
open delimiter("<!") and the comment open delimiter ("--"),
but is permitted between the comment close delimiter ("--") and
the markup declaration close delimiter (">").
A common error is to include a string of hyphens ("---") within a comment.
Authors should avoid putting two or more adjacent hyphens inside comments.
Information that appears between comments has no special meaning
(e.g., character references are not interpreted as such).
Note that comments are markup.
This method uses a state machine with the following states:
- state 0 - prior to the first open delimiter (first dash)
- state 1 - prior to the second open delimiter (second dash)
- state 2 - prior to the first closing delimiter (first dash)
- state 3 - prior to the second closing delimiter (second dash)
- state 4 - prior to the terminating >
All comment text (everything excluding the < and >), is included
in the remark text.
We allow terminators like --!> even though this isn't part of the spec.
start
- The position at which to start scanning.quotesmart
- If true
, strings ignore quoted contents.
parseString
protected Node parseString(int start,
boolean quotesmart)
throws ParserException
Parse a string node.
Scan characters until "</", "<%", "<!" or < followed by a
letter is encountered, or the input stream is exhausted, in which
case null
is returned.
start
- The position at which to start scanning.quotesmart
- If true
, strings ignore quoted contents.
parseTag
protected Node parseTag(int start)
throws ParserException
Parse a tag.
Parse the name and attributes from a start tag.
From the
HTML 4.01 Specification, W3C Recommendation 24 December 1999
http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2
3.2.2 Attributes
Elements may have associated properties, called attributes, which may
have values (by default, or set by authors or scripts). Attribute/value
pairs appear before the final ">" of an element's start tag. Any number
of (legal) attribute value pairs, separated by spaces, may appear in an
element's start tag. They may appear in any order.
In this example, the id attribute is set for an H1 element:
<H1 id="section1">
This is an identified heading thanks to the id attribute
</H1>
By default, SGML requires that all attribute values be delimited using
either double quotation marks (ASCII decimal 34) or single quotation
marks (ASCII decimal 39). Single quote marks can be included within the
attribute value when the value is delimited by double quote marks, and
vice versa. Authors may also use numeric character references to
represent double quotes (") and single quotes (').
For doublequotes authors can also use the character entity reference
".
In certain cases, authors may specify the value of an attribute without
any quotation marks. The attribute value may only contain letters
(a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45),
periods (ASCII decimal 46), underscores (ASCII decimal 95),
and colons (ASCII decimal 58). We recommend using quotation marks even
when it is possible to eliminate them.
Attribute names are always case-insensitive.
Attribute values are generally case-insensitive. The definition of each
attribute in the reference manual indicates whether its value is
case-insensitive.
All the attributes defined by this specification are listed in the
attribute index.
This method uses a state machine with the following states:
- state 0 - outside of any attribute
- state 1 - within attributre name
- state 2 - equals hit
- state 3 - within naked attribute value.
- state 4 - within single quoted attribute value
- state 5 - within double quoted attribute value
- state 6 - whitespaces after attribute name could lead to state 2 (=)or state 0
The starting point for the various components is stored in an array
of integers that match the initiation point for the states one-for-one,
i.e. bookmarks[0] is where state 0 began, bookmarks[1] is where state 1
began, etc.
Attributes are stored in a
Vector
having
one slot for each whitespace or attribute/value pair.
The first slot is for attribute name (kind of like a standalone attribute).
start
- The position at which to start scanning.
reset
public void reset()
Reset the lexer to start parsing from the beginning again.
The underlying components are reset such that the next call to
nextNode()
will return the first lexeme on the page.
scanJIS
protected void scanJIS(Cursor cursor)
throws ParserException
Advance the cursor through a JIS escape sequence.
cursor
- A cursor positioned within the escape sequence.
setCursor
public void setCursor(Cursor cursor)
Set the current scanning position.
cursor
- The lexer's new cursor position.
setNodeFactory
public void setNodeFactory(NodeFactory factory)
Set the current node factory.
factory
- The node factory to be used by the lexer.
setPage
public void setPage(Page page)
Set the page this lexer is working on.
page
- The page that nodes will be read from.
setPosition
public void setPosition(int position)
Set the current cursor position.
position
- The new character offset into the source.