SourceForge.net Logo

Jericho HTML Parser

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It can also determine the data structures represented in an HTML form.

It is an open source library released under the GNU Lesser General Public License (LGPL). You are therefore free to use it in commercial applications subject to the terms detailed in the licence document.

For downloads, support, updates and release notes visit the SourceForge.net project page at http://sourceforge.net/projects/jerichohtml/

Please let me know if you are using the library in your own project or find it useful in any way. You can also rate it at http://freshmeat.net/projects/jerichohtml/

All classes and methods have been comprehensively documented in the javadocs.

The package description contains a brief overview of how to use the package.

At this time no files have been submitted into CVS. If others are interested in extending or porting the library, a CVS repository will be made available.

Features

The library distinguishes itself from other HTML parsers by its four major features:

Sample Programs

The samples directory in the download package contains sample programs for performing common tasks. The .bat files can be run directly on a MS-Windows operating system, or the following syntax can be used on a UNIX based operating system from the samples directory:

java -classpath bin;../lib/jericho-html-x.x.jar ProgramName

where x.x is the current release number and ProgramName is the name of the sample program to run.

The following sample programs are available:

ConvertStyleSheets Demonstrates how to detect all external style sheets and place them inline into the document.
DisplayAllElements Demonstrates the behaviour of the library when retrieving all elements from a document containing a mix of normal HTML, different types of server tags, and badly formatted HTML.
DisplayFormFields Demonstrates the use of the Segment.findFormFields() method.
DisplaySpecialTags Demonstrates how to search for special tags such as document type declarations, XML declarations, processing instructions, common server tags, PHP tags, Mason tags, and HTML comments.
JSPTest Demonstrates how to parse a document containing JSP tags without the server tags interfering with the syntax of the HTML.
SplitLongLines Demonstrates how to reformat a document so that lines exceeding a certain number of characters are split into multiple lines.

Handling of Invalid or Badly Formatted HTML

Note that although the library does a good job of analysing documents containing invalid or badly formatted HTML in areas irrelevant to the analysis, any attempt to analyse the badly formatted HTML itself will yield unpredictable results, which may or may not correspond with the interpretation of the majority of user agents. Furthermore, the behaviour of the library in relation to badly formatted HTML is not guaranteed to remain consistent in future versions. An exception to this is where any of the sample files containing badly formatted HTML produce particular results in any of the sample applications.

Building

The build and sample files are implemented as DOS .bat files only. This is because I wanted to avoid the need to install ANT for such a simple library. Sorry to all the unix users for the inconvenience, but the batch files really don't do anything complicated anyway.

The javadoc compiler in j2sdk 1.4.0 has a problem with the first line of documentation in the Element.isInline() and Element.isBlock() methods which causes an exception to be thrown. This apparent bug in the javadoc processor has been fixed in j2sdk 1.4.2.

Alternative HTML Parsers

This package was originally written in the latter half of 2002. At that time I evaluated 6 other parsers, none of which were capable of achieving my aims. Most couldn't reproduce a typical HTML document without change, none could reproduce a source document containing badly formatted or non-HTML components without change, and none provided a means to track the positions of nodes in the source text. A list of these parsers and a brief description follows, but please note that I have not revised this analysis since the before this package was written. Please let me know if there are any errors.