Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It can also determine the data structures represented in an HTML form.
It is an open source library released under the GNU Lesser General Public License (LGPL). You are therefore free to use it in commercial applications subject to the terms detailed in the licence document.
For downloads, support, updates and release notes visit the SourceForge.net project page at http://sourceforge.net/projects/jerichohtml/
Please let me know if you are using the library in your own project or find it useful in any way. You can also rate it at http://freshmeat.net/projects/jerichohtml/
All classes and methods have been comprehensively documented in the javadocs.
The package description contains a brief overview of how to use the package.
At this time no files have been submitted into CVS. If others are interested in extending or porting the library, a CVS repository will be made available.
The library distinguishes itself from other HTML parsers by its four major features:
FormField
objects can automatically be generated
from the source document. These provide a very useful means for determining how to store
and present data that is submitted from an arbitrary HTML form.
Segment.ignoreWhenParsing()
)
The samples
directory in the download package contains sample programs
for performing common tasks.
The .bat
files can be run directly on a MS-Windows operating system,
or the following syntax can be used on a UNIX based operating system from the samples
directory:
java -classpath bin;../lib/jericho-html-x.x.jar ProgramName
where x.x
is the current release number and ProgramName
is the name of the sample program to run.
The following sample programs are available:
ConvertStyleSheets | Demonstrates how to detect all external style sheets and place them inline into the document. |
DisplayAllElements | Demonstrates the behaviour of the library when retrieving all elements from a document containing a mix of normal HTML, different types of server tags, and badly formatted HTML. |
DisplayFormFields |
Demonstrates the use of the Segment.findFormFields() method.
|
DisplaySpecialTags | Demonstrates how to search for special tags such as document type declarations, XML declarations, processing instructions, common server tags, PHP tags, Mason tags, and HTML comments. |
JSPTest | Demonstrates how to parse a document containing JSP tags without the server tags interfering with the syntax of the HTML. |
SplitLongLines | Demonstrates how to reformat a document so that lines exceeding a certain number of characters are split into multiple lines. |
Note that although the library does a good job of analysing documents containing invalid or badly formatted HTML in areas irrelevant to the analysis, any attempt to analyse the badly formatted HTML itself will yield unpredictable results, which may or may not correspond with the interpretation of the majority of user agents. Furthermore, the behaviour of the library in relation to badly formatted HTML is not guaranteed to remain consistent in future versions. An exception to this is where any of the sample files containing badly formatted HTML produce particular results in any of the sample applications.
The build and sample files are implemented as DOS .bat files only. This is because I wanted to avoid the need to install ANT for such a simple library. Sorry to all the unix users for the inconvenience, but the batch files really don't do anything complicated anyway.
The javadoc compiler in j2sdk 1.4.0 has a problem with the first line of documentation in the
Element.isInline()
and Element.isBlock()
methods which causes an exception
to be thrown. This apparent bug in the javadoc processor has been fixed in j2sdk 1.4.2.
This package was originally written in the latter half of 2002. At that time I evaluated 6 other parsers, none of which were capable of achieving my aims. Most couldn't reproduce a typical HTML document without change, none could reproduce a source document containing badly formatted or non-HTML components without change, and none provided a means to track the positions of nodes in the source text. A list of these parsers and a brief description follows, but please note that I have not revised this analysis since the before this package was written. Please let me know if there are any errors.