Represents the contents of an HTML page.
Contains the source of characters and an index of positions of line
separators (actually the first character position on the next line).
DEFAULT_CHARSET
public static final String DEFAULT_CHARSET
The default charset.
This should be
,
see RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616)
section 3.7.1
Another alias is "8859_1".
DEFAULT_CONTENT_TYPE
public static final String DEFAULT_CONTENT_TYPE
The default content type.
In the absence of alternate information, assume html content ().
EOF
public static final char EOF
Character value when the page is exhausted.
Has a value of .
mBaseUrl
protected String mBaseUrl
The base URL for this page.
mConnection
protected URLConnection mConnection
The connection this page is coming from or null
.
mConnectionManager
protected static ConnectionManager mConnectionManager
Connection control (proxy, cookies, authorization).
mIndex
protected PageIndex mIndex
Character positions of the first character in each line.
mSource
protected Source mSource
The source of characters.
mUrl
protected String mUrl
The URL this page is coming from.
Cached value of getConnection().toExternalForm()
or
setUrl()
.
Page
public Page()
Construct an empty page.
Page
public Page(InputStream stream,
String charset)
throws UnsupportedEncodingException
Construct a page from a stream encoded with the given charset.
stream
- The source of bytes.charset
- The encoding used.
If null, defaults to the DEFAULT_CHARSET
.
Page
public Page(String text)
Construct a page from the given string.
The page will report that it is using an encoding of
DEFAULT_CHARSET
.
Page
public Page(String text,
String charset)
Construct a page from the given string.
text
- The HTML text.charset
- Optional. The character set encoding that will
be reported by getEncoding()
. If charset is null
the default character set is used.
Page
public Page(URLConnection connection)
throws ParserException
Construct a page reading from a URL connection.
connection
- A fully conditioned connection. The connect()
method will be called so it need not be connected yet.
ParserException
- An exception object wrapping a number of
possible error conditions, some of which are outlined below.
- IOException If an i/o exception occurs creating the
source.
- UnsupportedEncodingException if the character set specified in the
HTTP header is not supported.
Page
public Page(Source source)
Construct a page from a source.
source
- The source of characters.
close
public void close()
throws IOException
Close the page by destroying the source of characters.
column
public int column(int position)
Get the column number for a cursor.
position
- The character offset into the page.
- The character offset into the line this cursor is on.
column
public int column(Cursor cursor)
Get the column number for a cursor.
cursor
- The character offset into the page.
- The character offset into the line this cursor is on.
constructUrl
public URL constructUrl(String link,
String base)
throws MalformedURLException
Build a URL from the link and base provided using non-strict rules.
link
- The (relative) URI.base
- The base URL of the page, either from the <BASE> tag
or, if none, the URL the page is being fetched from.
constructUrl
public URL constructUrl(String link,
String base,
boolean strict)
throws MalformedURLException
Build a URL from the link and base provided.
link
- The (relative) URI.base
- The base URL of the page, either from the <BASE> tag
or, if none, the URL the page is being fetched from.strict
- If true
a link starting with '?' is handled
according to RFC 2396,
otherwise the common interpretation of a query appended to the base
is used instead.
finalize
protected void finalize()
throws Throwable
Clean up this page, releasing resources.
Calls close()
.
findCharset
public static String findCharset(String name,
String fallback)
Lookup a character set name.
Vacuous for JVM's without java.nio.charset
.
This uses reflection so the code will still run under prior JDK's but
in that case the default is always returned.
name
- The name to look up. One of the aliases for a character set.fallback
- The name to return if the lookup fails.
getAbsoluteURL
public String getAbsoluteURL(String link)
Create an absolute URL from a relative link.
link
- The reslative portion of a URL.
- The fully qualified URL or the original link if it was absolute
already or a failure occured.
getAbsoluteURL
public String getAbsoluteURL(String link,
boolean strict)
Create an absolute URL from a relative link.
link
- The reslative portion of a URL.strict
- If true
a link starting with '?' is handled
according to RFC 2396,
otherwise the common interpretation of a query appended to the base
is used instead.
- The fully qualified URL or the original link if it was absolute
already or a failure occured.
getBaseUrl
public String getBaseUrl()
Gets the baseUrl.
- The base URL for this page, or
null
if not set.
getCharacter
public char getCharacter(Cursor cursor)
throws ParserException
Read the character at the given cursor position.
The cursor position can be only behind or equal to the
current source position.
Returns end of lines (EOL) as \n, by converting \r and \r\n to \n,
and updates the end-of-line index accordingly.
Advances the cursor position by one (or two in the \r\n case).
cursor
- The position to read at.
- The character at that position, and modifies the cursor to
prepare for the next read. If the source is exhausted a zero is returned.
ParserException
- If an IOException on the underlying source
occurs, or an attempt is made to read characters in the future (the
cursor position is ahead of the underlying stream)
getCharset
public String getCharset(String content)
Get a CharacterSet name corresponding to a charset parameter.
content
- A text line of the form:
text/html; charset=Shift_JIS
which is applicable both to the HTTP header field Content-Type and
the meta tag http-equiv="Content-Type".
Note this method also handles non-compliant quoted charset directives
such as:
text/html; charset="UTF-8"
and
text/html; charset='UTF-8'
- The character set name to use when reading the input stream.
For JDKs that have the Charset class this is qualified by passing
the name to findCharset() to render it into canonical form.
If the charset parameter is not found in the given string, the default
character set is returned.
getConnection
public URLConnection getConnection()
Get the connection, if any.
- The connection object for this page, or null if this page
is built from a stream or a string.
getConnectionManager
public static ConnectionManager getConnectionManager()
Get the connection manager all Parsers use.
getContentType
public String getContentType()
Try and extract the content type from the HTTP header.
getEncoding
public String getEncoding()
Get the current encoding being used.
- The encoding used to convert characters.
getLine
public String getLine(int position)
Get the text line the position of the cursor lies on.
position
- The position to calculate for.
- The contents of the URL or file corresponding to the line number
containg the cursor position.
getLine
public String getLine(Cursor cursor)
Get the text line the position of the cursor lies on.
cursor
- The position to calculate for.
- The contents of the URL or file corresponding to the line number
containing the cursor position.
getSource
public Source getSource()
Get the source this page is reading from.
getText
public String getText()
Get all text read so far from the source.
- The text from the source.
getText
public void getText(StringBuffer buffer)
Put all text read so far from the source into the given buffer.
buffer
- The accumulator for the characters.
getText
public void getText(StringBuffer buffer,
int start,
int end)
throws IllegalArgumentException
Put the text identified by the given limits into the given buffer.
buffer
- The accumulator for the characters.start
- The starting position, zero based.end
- The ending position
(exclusive, i.e. the character at the ending position is not included),
zero based.
getText
public void getText(char[] array,
int offset,
int start,
int end)
throws IllegalArgumentException
Put the text identified by the given limits into the given array at the specified offset.
array
- The array of characters.offset
- The starting position in the array where characters are to be placed.start
- The starting position, zero based.end
- The ending position
(exclusive, i.e. the character at the ending position is not included),
zero based.
getText
public String getText(int start,
int end)
throws IllegalArgumentException
Get the text identified by the given limits.
start
- The starting position, zero based.end
- The ending position
(exclusive, i.e. the character at the ending position is not included),
zero based.
- The text from
start
to end
.
getUrl
public String getUrl()
Get the URL for this page.
This is only available if the page has a connection
(getConnection()
returns non-null), or the document base has
been set via a call to setUrl()
.
- The url for the connection, or
null
if there is
no conenction or the document base has not been set.
reset
public void reset()
Reset the page by resetting the source of characters.
row
public int row(int position)
Get the line number for a cursor.
position
- The character offset into the page.
- The line number the character is in.
row
public int row(Cursor cursor)
Get the line number for a cursor.
cursor
- The character offset into the page.
- The line number the character is in.
setBaseUrl
public void setBaseUrl(String url)
Sets the baseUrl.
url
- The base url for this page.
setConnection
public void setConnection(URLConnection connection)
throws ParserException
Set the URLConnection to be used by this page.
Starts reading from the given connection.
This also resets the current url.
connection
- The connection to use.
It will be connected by this method.
ParserException
- If the connect()
method fails,
or an I/O error occurs opening the input stream or the character set
designated in the HTTP header is unsupported.
setConnectionManager
public static void setConnectionManager(ConnectionManager manager)
Set the connection manager to use.
manager
- The new connection manager.
setEncoding
public void setEncoding(String character_set)
throws ParserException
Begins reading from the source with the given character set.
If the current encoding is the same as the requested encoding,
this method is a no-op. Otherwise any subsequent characters read from
this page will have been decoded using the given character set.
Some magic happens here to obtain this result if characters have already
been consumed from this page.
Since a Reader cannot be dynamically altered to use a different character
set, the underlying stream is reset, a new Source is constructed
and a comparison made of the characters read so far with the newly
read characters up to the current position.
If a difference is encountered, or some other problem occurs,
an exception is thrown.
character_set
- The character set to use to convert bytes into
characters.
ParserException
- If a character mismatch occurs between
characters already provided and those that would have been returned
had the new character set been in effect from the beginning. An
exception is also thrown if the underlying stream won't put up with
these shenanigans.
setUrl
public void setUrl(String url)
Set the URL for this page.
This doesn't affect the contents of the page, just the interpretation
of relative links from this point forward.
toString
public String toString()
Display some of this page as a string.
- The last few characters the source read in.
ungetCharacter
public void ungetCharacter(Cursor cursor)
throws ParserException
Return a character.
Handles end of lines (EOL) specially, retreating the cursor twice for
the '\r\n' case.
The cursor position is moved back by one (or two in the \r\n case).
cursor
- The position to 'unread' at.