HTML Parser Home Page | |
Prev Class | Next Class | Frames | No Frames |
Summary: Nested | Field | Method | Constr | Detail: Nested | Field | Method | Constr |
java.lang.Object
org.htmlparser.nodes.AbstractNode
Page
, the
starting and ending position in the page, the parent and the list of
children
.
Field Summary | |
protected NodeList |
|
protected Page |
|
protected int |
|
protected int |
|
protected Node |
|
Constructor Summary | |
|
Method Summary | |
abstract void |
|
Object |
|
void |
|
void |
|
NodeList |
|
int |
|
Node |
|
Node |
|
Node |
|
Page |
|
Node |
|
Node |
|
int |
|
String |
|
void |
|
void |
|
void | |
void | |
void |
|
void |
|
String |
|
abstract String |
|
abstract String |
|
abstract String |
|
protected int nodeBegin
The beginning position of the tag in the line
protected int nodeEnd
The ending position of the tag in the line
public AbstractNode(Page page, int start, int end)
Create an abstract node with the page positions given. Remember the page and start & end cursor positions.
- Parameters:
page
- The page this tag was read from.start
- The starting offset of this node within the page.end
- The ending offset of this node within the page.
public abstract void accept(NodeVisitor visitor)
Visit this node.
- Parameters:
visitor
- The visitor that is visiting this node.
public Object clone() throws CloneNotSupportedException
Clone this object. Exposes java.lang.Object clone as a public method.
- Returns:
- A clone of this object.
public void collectInto(NodeList list, NodeFilter filter)
Collect this node and its child nodes (if-applicable) into the collectionList parameter, provided the node satisfies the filtering criteria. This mechanism allows powerful filtering code to be written very easily, without bothering about collection of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it at the top-level, as many tags (like form tags), can contain links embedded in them. We could get the links out by checking if the current node is aCompositeTag
, and going through its children. So this method provides a convenient way to do this. Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look like:NodeList collectionList = new NodeList(); NodeFilter filter = new TagNameFilter ("A"); for (NodeIterator e = parser.elements(); e.hasMoreNodes();) e.nextNode().collectInto(collectionList, filter);Thus, collectionList will hold all the link nodes, irrespective of how deep the links are embedded. Another way to accomplish the same objective is:NodeList collectionList = new NodeList(); NodeFilter filter = new TagClassFilter (LinkTag.class); for (NodeIterator e = parser.elements(); e.hasMoreNodes();) e.nextNode().collectInto(collectionList, filter);This is slightly less specific because the LinkTag class may be registered for more than one node name, e.g. <LINK> tags too.
- Specified by:
- collectInto in interface Node
- Parameters:
list
- The node list to collect acceptable nodes into.filter
- The filter to determine which nodes are retained.
public void doSemanticAction() throws ParserException
Perform the meaning of this tag. The default action is to do nothing.
- Specified by:
- doSemanticAction in interface Node
- Throws:
ParserException
- Not used. Provides for subclasses that may want to indicate an exceptional condition.
public NodeList getChildren()
Get the children of this node.
- Specified by:
- getChildren in interface Node
- Returns:
- The list of children contained by this node, if it's been set,
null
otherwise.
public int getEndPosition()
Gets the ending position of the node.
- Specified by:
- getEndPosition in interface Node
- Returns:
- The end position.
public Node getFirstChild()
Get the first child of this node.
- Specified by:
- getFirstChild in interface Node
- Returns:
- The first child in the list of children contained by this node,
null
otherwise.
public Node getLastChild()
Get the last child of this node.
- Specified by:
- getLastChild in interface Node
- Returns:
- The last child in the list of children contained by this node,
null
otherwise.
public Node getNextSibling()
Get the next sibling to this node.
- Specified by:
- getNextSibling in interface Node
- Returns:
- The next sibling to this node if one exists,
null
otherwise.
public Page getPage()
Get the page this node came from.
- Returns:
- The page that supplied this node.
public Node getParent()
Get the parent of this node. This will always return null when parsing without scanners, i.e. if semantic parsing was not performed. The object returned from this method can be safely cast to aCompositeTag
.
- Returns:
- The parent of this node, if it's been set,
null
otherwise.
public Node getPreviousSibling()
Get the previous sibling to this node.
- Specified by:
- getPreviousSibling in interface Node
- Returns:
- The previous sibling to this node if one exists,
null
otherwise.
public int getStartPosition()
Gets the starting position of the node.
- Specified by:
- getStartPosition in interface Node
- Returns:
- The start position.
public String getText()
Returns the text of the node.
- Returns:
- The text of this node. The default is
null
.
public void setChildren(NodeList children)
Set the children of this node.
- Specified by:
- setChildren in interface Node
- Parameters:
children
- The new list of children this node contains.
public void setEndPosition(int position)
Sets the ending position of the node.
- Specified by:
- setEndPosition in interface Node
- Parameters:
position
- The new end position.
public void setPage(Page page)
Set the page this node came from.
- Parameters:
page
- The page that supplied this node.
public void setParent(Node node)
Sets the parent of this node.
- Parameters:
node
- The node that contains this node. Must be aCompositeTag
.
public void setStartPosition(int position)
Sets the starting position of the node.
- Specified by:
- setStartPosition in interface Node
- Parameters:
position
- The new start position.
public void setText(String text)
Sets the string contents of the node.
- Parameters:
text
- The new text for the node.
public String toHtml()
Return the HTML for this node. This should be the sequence of characters that were encountered by the parser that caused this node to be created. Where this breaks down is where broken nodes (tags and remarks) have been encountered and fixed. Applications reproducing html can use this method on nodes which are to be used or transferred as they were received or created.
- Returns:
- The sequence of characters that would cause this node to be returned by the parser or lexer.
public abstract String toHtml(boolean verbatim)
Return the HTML for this node. This should be the exact sequence of characters that were encountered by the parser that caused this node to be created. Where this breaks down is where broken nodes (tags and remarks) have been encountered and fixed. Applications reproducing html can use this method on nodes which are to be used or transferred as they were received or created.
- Parameters:
verbatim
- Iftrue
return as close to the original page text as possible.
- Returns:
- The (exact) sequence of characters that would cause this node to be returned by the parser or lexer.
public abstract String toPlainTextString()
Returns a string representation of the node. It allows a simple string transformation of a web page, regardless of node type.
Typical application code (for extracting only the text from a web page) would then be simplified to:
Node node; for (Enumeration e = parser.elements (); e.hasMoreElements (); ) { node = (Node)e.nextElement(); System.out.println (node.toPlainTextString ()); // or do whatever processing you wish with the plain text string }
- Specified by:
- toPlainTextString in interface Node
- Returns:
- The 'browser' content of this node.
public abstract String toString()
Return a string representation of the node. Subclasses must define this method, and this is typically to be used in the manner
System.out.println(node)
- Returns:
- A textual representation of the node suitable for debugging
© 2005 Derrick Oswald Mai 08, 2008 |
HTML Parser is an open source library released under LGPL. | |