|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectau.id.jericho.lib.html.Segment
au.id.jericho.lib.html.Source
Represents a source HTML document.
Note that many of the useful functions which can be performed on the source document are
defined in its superclass, Segment
.
The Source object is itself a Segment which spans the entire document.
Most of the methods defined in this class are useful for determining the elements and tags surrounding or neighbouring a particular character position in the document.
IMPORTANT NOTE: Because HTML allows '<
' characters within attribute values
(see section 5.3.2 of the HTML spec),
it is theoretically impossible to determine with certainty whether
any given '<
' character in a source document is the start of a tag
without having parsed from the beginning of the document (which Jericho HTML Parser doesn't do).
For this reason, the parser may reject a start tag completely if its attributes are not
properly formed, although it does try to provide some leniency.
In XHTML, such characters must be represented in attribute values as character entities.
(see section 3.1 of the XML spec)
Segment
Constructor Summary | |
Source(java.lang.CharSequence text)
Constructs a new Source object with the specified text. |
Method Summary | |
Segment |
findEnclosingComment(int pos)
Returns a Segment spanning the HTML comment that encloses the specified position in the source document. |
Element |
findEnclosingElement(int pos)
Returns the most nested Element enclosing the specified position in the source document. |
Element |
findEnclosingElement(int pos,
java.lang.String name)
Returns the most nested Element with the specified name enclosing the specified position in the source document. |
StartTag |
findEnclosingStartTag(int pos)
Returns the StartTag enclosing the specified position in the source document. |
CharacterReference |
findNextCharacterReference(int pos)
Returns the CharacterReference beginning at or immediately following the specified position in the source document. |
StartTag |
findNextComment(int pos)
Returns the Comment beginning at or immediately following the specified position in the source document. |
EndTag |
findNextEndTag(int pos)
Returns the EndTag beginning at or immediately following the specified position in the source document. |
EndTag |
findNextEndTag(int pos,
java.lang.String name)
Returns the EndTag with the specified name beginning at or immediately following the specified position in the source document. |
StartTag |
findNextStartTag(int pos)
Returns the StartTag beginning at or immediately following the specified position in the source document. |
StartTag |
findNextStartTag(int pos,
java.lang.String name)
Returns the StartTag with the specified name beginning at or immediately following the specified position in the source document. |
StartTag |
findNextStartTag(int pos,
java.lang.String attributeName,
java.lang.String value,
boolean valueCaseSensitive)
Returns the StartTag with the specified attribute name/value pair beginning at or immediately following the specified position in the source document. |
Tag |
findNextTag(int pos)
Returns the tag (either a StartTag or EndTag ) beginning at or immediately following the specified position in the source document. |
CharacterReference |
findPreviousCharacterReference(int pos)
Returns the CharacterReference at or immediately preceding (or enclosing) the specified position in the source document. |
EndTag |
findPreviousEndTag(int pos,
java.lang.String name)
Returns the EndTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document. |
StartTag |
findPreviousStartTag(int pos)
Returns the StartTag at or immediately preceding (or enclosing) the specified position in the source document. |
StartTag |
findPreviousStartTag(int pos,
java.lang.String name)
Returns the StartTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document. |
Element |
getElementById(java.lang.String id)
Returns the Element with the specified id attribute value. |
java.util.Iterator |
getNextTagIterator(int pos)
Returns an iterator of Tag objects beginning at or immediately following the specified position in the source document. |
void |
ignoreWhenParsing(java.util.Collection segments)
Causes all of the segments in the specified collection to be ignored when parsing. |
void |
ignoreWhenParsing(int begin,
int end)
Causes the specified range of the source text to be ignored when parsing. |
Attributes |
parseAttributes(int pos,
int maxEnd)
Parses any Attributes starting at the specified position. |
Attributes |
parseAttributes(int pos,
int maxEnd,
int maxErrorCount)
Parses any Attributes starting at the specified position. |
void |
setLogWriter(java.io.Writer writer)
Sets the destination for log messages. |
java.lang.String |
toString()
Returns the source text as a String . |
Methods inherited from class au.id.jericho.lib.html.Segment |
charAt, compareTo, encloses, encloses, equals, findAllCharacterReferences, findAllComments, findAllElements, findAllElements, findAllStartTags, findAllStartTags, findAllStartTags, findFormControls, findFormFields, findWords, getBegin, getDebugInfo, getEnd, getSourceText, getSourceTextNoWhitespace, hashCode, ignoreWhenParsing, isComment, isWhiteSpace, length, parseAttributes, subSequence |
Methods inherited from class java.lang.Object |
getClass, notify, notifyAll, wait, wait, wait |
Constructor Detail |
public Source(java.lang.CharSequence text)
Source
object with the specified text.
text
- the source text.Method Detail |
public java.lang.String toString()
String
.
If the original CharSequence
supplied when this instance was constructed was not a String
,
the first conversion of the text to a String
is cached for subsequent calls.
toString
in interface java.lang.CharSequence
toString
in class Segment
String
.public Element getElementById(java.lang.String id)
Element
with the specified id
attribute value.
This simulates the script method
getElementById
defined in DOM HTML level 1.
This is equivalent to findNextStartTag(0,"id",id,true).getElement()
.
A well formed HTML document should have no more than one element with any given id
attribute value.
Calls to this method are not cached.
id
- the id
attribute value (case sensitive) to search for, must not be null
.
Element
with the specified id
attribute value.public StartTag findPreviousStartTag(int pos)
StartTag
at or immediately preceding (or enclosing) the specified position in the source document.
If the specified position is within an HTML comment, the segment spanning the comment is returned.
pos
- the position in the source document from which to start the search.
StartTag
immediately preceding the specified position in the source document, or null
if none exists.public StartTag findPreviousStartTag(int pos, java.lang.String name)
StartTag
with the specified name at or immediately preceding (or enclosing) the specified position in the source document.
Start tags positioned within an HTML comment are ignored, but the comment segment itself is treated as a start tag.
Specifying a null
name parameter is equivalent to findPreviousStartTag(pos)
.
pos
- the position in the source document from which to start the search.name
- the name of the StartTag
to search for.
StartTag
with the specified name immediately preceding the specified position in the source document, or null
if none exists.public StartTag findNextStartTag(int pos)
StartTag
beginning at or immediately following the specified position in the source document.
StartTags positioned within an HTML comment are ignored, but subsequent comment segments are treated as start tags.
pos
- the position in the source document from which to start the search.
StartTag
beginning at or immediately following the specified position in the source document, or null
if none exists.public StartTag findNextStartTag(int pos, java.lang.String name)
StartTag
with the specified name beginning at or immediately following the specified position in the source document.
Start tags positioned within an HTML comment are ignored.
Specifying a null
name parameter is equivalent to findNextStartTag(pos)
.
Specifying a name parameter ending in a colon (:
) searches for all start tags in the specified XML namespace.
pos
- the position in the source document from which to start the search.name
- the name of the StartTag
to search for.
StartTag
with the specified name beginning at or immediately following the specified position in the source document, or null
if none exists.public StartTag findNextStartTag(int pos, java.lang.String attributeName, java.lang.String value, boolean valueCaseSensitive)
StartTag
with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.
Calls to this method are not cached.
pos
- the position in the source document from which to start the search.attributeName
- the attribute name (case insensitive) to search for, must not be null
.value
- the value of the specified attribute to search for, must not be null
.valueCaseSensitive
- specifies whether the attribute value matching is case sensitive.
StartTag
with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.public StartTag findNextComment(int pos)
If the specified position is within a comment, the comment following the enclosing comment is returned.
pos
- the position in the source document from which to start the search.
null
if none exists.public EndTag findPreviousEndTag(int pos, java.lang.String name)
End tags positioned within an HTML comment are ignored.
pos
- the position in the source document from which to start the search.name
- the name of the EndTag to search for, must not be null
.
null
if none exists.public EndTag findNextEndTag(int pos)
End tags positioned within an HTML comment are ignored.
pos
- the position in the source document from which to start the search.
null
if none exists.public EndTag findNextEndTag(int pos, java.lang.String name)
End tags positioned within an HTML comment are ignored.
pos
- the position in the source document from which to start the search.name
- the name of the EndTag to search for, must not be null
.
null
if none exists.public java.util.Iterator getNextTagIterator(int pos)
Tag
objects beginning at or immediately following the specified position in the source document.
Tags positioned within an HTML comment are ignored, but the comment segments themselves are treated as start tags.
pos
- the position in the source document from which to start the iteration.
Tag
objects beginning at or immediately following the specified position in the source document.public Tag findNextTag(int pos)
StartTag
or EndTag
) beginning at or immediately following the specified position in the source document.
IMPLEMENTATION NOTE: Sequential tags in a document should be retrieved using the iterator from
getNextTagIterator(int pos)
as it is far more efficient than using multiple calls to this method.
pos
- the position in the source document from which to start the search.
null
if none exists.getNextTagIterator(int pos)
public StartTag findEnclosingStartTag(int pos)
StartTag
enclosing the specified position in the source document.
If the specified position is within an HTML comment, the segment spanning the comment is returned.
A segment is considered to enclose a character position x ifsegment.getBegin() <= x < segment.getEnd()
pos
- the position in the source document.
StartTag
enclosing the specified position in the source document, or null
if the position is not within a StartTag.public Segment findEnclosingComment(int pos)
A segment is considered to enclose a character position x ifsegment.getBegin() <= x < segment.getEnd()
pos
- the position in the source document.
null
if the position is not within a comment.public Element findEnclosingElement(int pos)
If the specified position is within an HTML comment, the segment spanning the comment is returned.
A segment is considered to enclose a character position x ifsegment.getBegin() <= x < segment.getEnd()
pos
- the position in the source document.
null
if the position is not within an Element.public Element findEnclosingElement(int pos, java.lang.String name)
Elements positioned within an HTML comment are ignored, but the comment segment itself is treated as an Element.
pos
- the position in the source document.name
- the name of the Element to search for.
null
if none exists.public CharacterReference findPreviousCharacterReference(int pos)
CharacterReference
at or immediately preceding (or enclosing) the specified position in the source document.
Character references positioned within an HTML comment are NOT ignored.
pos
- the position in the source document from which to start the search.
CharacterReference
beginning at or immediately preceding the specified position in the source document, or null
if none exists.public CharacterReference findNextCharacterReference(int pos)
CharacterReference
beginning at or immediately following the specified position in the source document.
Character references positioned within an HTML comment are NOT ignored.
pos
- the position in the source document from which to start the search.
CharacterReference
beginning at or immediately following the specified position in the source document, or null
if none exists.public Attributes parseAttributes(int pos, int maxEnd)
Attributes
starting at the specified position.
This method is only used in the unusual situation where attributes exist outside of a start tag.
The StartTag.getAttributes()
method should be used in normal situations.
The returned Attributes segment will always begin at pos, and will end at the first occurrence of "/>" or ">" outside of a quoted attribute value, or at maxEnd, whichever comes first.
Only returns null
if the segment contains a major syntactical error
or more than the default maximum number of
minor syntactical errors.
This is equivalent to
parseAttributes(pos,maxEnd,Attributes.getDefaultMaxErrorCount())
pos
- the position in the source document at the beginning of the attribute listmaxEnd
- the maximum end position of the attribute list, or -1 if no maximum
Attributes
starting at the specified position, or null
if too many errors occur while parsing.StartTag.getAttributes()
,
Segment.parseAttributes()
public Attributes parseAttributes(int pos, int maxEnd, int maxErrorCount)
Attributes
starting at the specified position.
This method is only used in the unusual situation where attributes exist outside of a start tag.
The StartTag.getAttributes()
method should be used in normal situations.
Only returns null
if the segment contains a major syntactical error
or more than the specified number of minor syntactical errors.
The maxErrorCount argument overrides the default maximum number of minor errors allowed,
which can be set using the Attributes.setDefaultMaxErrorCount(int)
static method.
See parseAttributes(int pos, int maxEnd)
for more information.
pos
- the position in the source document at the beginning of the attribute listmaxEnd
- the maximum end position of the attribute list, or -1 if no maximummaxErrorCount
- the maximum number of minor errors allowed while parsing
Attributes
starting at the specified position, or null
if too many errors occur while parsing.StartTag.getAttributes()
,
parseAttributes(int pos, int MaxEnd)
public void ignoreWhenParsing(int begin, int end)
This method is usually used to exclude server tags or other non-HTML segments from the source text so that it does not interfere with the parsing of the surrounding HTML.
This is necessary because many server tags are used as attribute values and in other places within HTML tags, and very often contain characters that prevent the parser from recognising the surrounding tag.
For efficiency reasons, all segments to be ignored should be registered at once, without performing searches in between.
begin
- the beginning character position in the source text.end
- the end character position in the source text.Segment.ignoreWhenParsing()
public void ignoreWhenParsing(java.util.Collection segments)
This is equivalent to calling Segment.ignoreWhenParsing()
on each segment in the collection.
public void setLogWriter(java.io.Writer writer)
By default, the log writer is set to null
, which supresses log messages.
writer
- the java.io.Writer where log messages will be sent
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |