The lexer package is the base level I/O subsystem.
The lexer package is the base level I/O subsystem.
The lexer package is responsible for reading characters from the HTML source
and identifying the node lexemes. For example, the HTML code below would return
the list of nodes shown:
<html><head><title>Humoresque</title></head>
<body bgcolor='silver'>
Passengers will please refrain
from flushing toilets while the train
is standing in the station. I love you!
<p>
We encourage constipation
while the train is in the station
If the train can't go
then why should you.
</body>
</html>
- line 0, offset 0, to line 0, offset 6, html tag
- line 0, offset 6, to line 0, offset 12, head tag
- line 0, offset 12, to line 0, offset 19, title tag
- line 0, offset 19, to line 0, offset 29, string node "Humoresque"
- line 0, offset 29, to line 0, offset 37, end title tag
- line 0, offset 37, to line 0, offset 44, end head tag
- line 0, offset 44, to line 0, offset 45, string node "\n"
- line 1, offset 0, to line 1, offset 23, body tag
- line 1, offset 23, to line 4, offset 40, string node "\nPassengers...you!\n"
- line 5, offset 0, to line 5, offset 2, paragraph tag
- line 5, offset 3, to line 9, offset 21, string node "\nWe...you.\n"
- line 10, offset 0, to line 10, offset 7, end body tag
- line 10, offset 8, to line 10, offset 9, string "\n"
- line 11, offset 0, to line 11, offset 7, html tag
- line 11, offset 7, to line 11, offset 8, string node "\n"
Stream, Source, Page and Lexer
The package is arranged in four levels,
Stream
,
Source
Page
and
Lexer
in the order of lowest to
highest.
A
Stream
is raw bytes from the URLConnection or file. It has no
intelligence. A
Source
is raw characters, hence it knows about the
encoding scheme used and can be reset if a different encoding is detected after
partially reading in the text. A
Page
provides characters from the
source while maintaining the index of line numbers, and hence can be thought of
as an array of strings corresponding to source file lines, but it doesn't
actually store any text, relying on the buffering within the
Source
instead. The
Lexer
contains the actual lexeme parsing
code. It reads characters from the page, keeping track of where it is with a
Cursor
and creates the array of nodes using various state
machines.
The following are some design goals and 'invariants' within the package, if you
are attempting to understand or modify it.
- in text - parseString()
- in comment - parseRemark()
- in tag - parseTag()
- in JSP tag - parseJsp()
htmlparser.jar