In most cases, we'd like to believe you have no preprocessing to do in Zephyr. Steps like unzipping files and placing them in the correct HDFS location SHOULD be managed in more robust systems for this purpose, like OODT or Nutch (or Oozie or even just bash scripts). However, sometimes our data is anything but clean. Sometimes our data needs to be organized a little better, to be cleaned up solely in terms of its adherence to basic Csv or XML rules. A variety of tools exist to do jobs like these; tools like DataWrangler or OptiWrangle. Rather than write very complex parsers, it would be far nicer if we could offload this data cleanup and data enhancement step to another tool who handles it better. Preprocessors are brand new and still need a lot of work, but it is our hope that by integrating DataWrangle as a preprocessor, we can hopefully avoid complex custom parsers in favor of using a generalized, configurable CsvParser - which makes for less code we need to maintain.
Our parser API is very simple, though it's possible this simplicity may cause issues for far more complex parsing
needs (if so, please contact us!). The parser interface is as follows:
public interface Parser {
public ProcessingResult
, byte[]> parse() throws IOException;
public Parser newInstance(InputStream inputStream);
}
The method public Parser newInstance(InputStream inputStream)
is a factory method that will create a
new Parser object with the same configuration information as the original Parser. As you can see, it requires an
InputStream to be provided for a new Parser to be created; this Parser then will operate over that InputStream and
only that InputStream.
As you can see from the ProcessingResult
method, it returns a custom, parameterized object called , byte[]> parse()
ProcessingResult
. ProcessingResult is
a pretty simple class; the first parameter defines the type of data that comes out if the parse() method was
successful, the second parameter defines the "raw original data" that failed our processing step, and hidden inside
the object is a Throwable - or the reason why our parse call may have failed.
If parse() cannot be
completed because we are finished with our InputStream, we simply return null.
A Schema is our java class that defines what data we expect to be provided from the parser (in List
Vital Note Our schemas allow you to create *multiple* schemata entries for the same incoming field. You
would use this capability if you had a single data field that needed to be split out into N number of fields,
each with their own validation routines and normalization procedures (an example would be if you wanted to
explicitly write out the year, month, and day of a YYYY-MM-DD formatted field, or split a single
latitude/longitude field into separate fields).
Pre-validation is most often simply a not empty check. We're ultimately using this to filter out data from running
0..N normalizers over it if it doesn't even exist - wasted processing time. However, you could write your own,
custom Validator that implements the following interface, to do whatever you wish!
public interface Validator {
boolean isValid(final String value);
}
As we mentioned above, we can support 0..N normalizers. Some data, we want to take it as it is - but others, most
normally things like "DateTime", come in in a variety of formats. Rather than write
date conversion routines into each of our analytics, it makes far more sense to normalize our DateTime into a
standard format (like ISO 8601). This also holds true for things like DMS to Decimal Degrees, or converting all
distances into miles. The reason we support 0..N is the idea of, hopefully, chaining Normalizers together to achieve
desired functionality instead of writing a custom normalizer for every new field. "Generally, when it makes sense to
generalize, do, but not a moment before. For the most part."
Normalizers, like validators, have a pretty simple contract - configuration information for Normalizers (and
Validators) would go into their constructor in their Schema definition:
public interface Normalizer {
String normalize(String value) throws NormalizationException;
}
Vital NoteIf a field fails Normalization through a NormalizationException, it will not be added to the
resulting data structure - and if it is a required field, may result in the dropping of the row.
After pre-validation, after our normalization routines, we have the post validation step. It may very well be the precise same step as the pre-validation routine - ultimately, it is meant to ensure that your normalization routines didn't result in data far from what you actually expected. It's a double check on human error!
Canonicalization is the step where we turn our incoming data's name from, say, "text", into "textual-data-element". Or "lat" into "latitude". Or "email" into "//person//contact//email". Ultimately, the finalized nomenclature is up to you and your organization. You may want to go all out and back your canonicalization with an ontology - or you may be fine with a taxonomy. Or you may be okay with a laissez-faire, ad-hoc, anything-goes process. Because we want to support the former, even the latter must go through the CatalogService lookup phase. Currently, our default, provided CatalogService behaves as a simple pass through. It is expected that you will implement a different CatalogService implmentation, possibly with a more rigorous, peer reviewed, curated system, for better interoperability between groups in your organization.
In the entirety of this Schema mapping step, we have been operating over a List
The Record is simply a collection of Entries that indicate a record, event, or row. It is Iterable
Enrichment is any activity where we take one or more than one field of an Record and do a micro analytic activity upon it. Enrichment is row/record/event specific - it isn't judging other rows/records/events against the one it is enriching, but it is enriching itself with the data it contains. These enrichers are chained in sequence and should not expect thrown exceptions to cease further enrichment activities. Enrichers should not be seen as providing necessary fields, but instead trying to make the Record more useful for further analytics, if it can. Consider them as a "nicety".
Output is an interesting step. Right now, we provide a few capabilities in Zephyr Core for turning an Record into a
"Hive Table Formatted" file, with Entries reordered to fit our outgoing table structure, and null fields to be
written out (generally speaking, Records don't often contain empty fields unless you specifically ask for them by
using the AlwaysValidValidator. It's extra data and unnecessary waste, especially in platforms that operate over
network IO instead of in memory.) However, it's also possible that you may want to write to Accumulo or some other
system - or even output your data to HDFS in a completely different file format. In the case of file formats, we
have you covered with our
org.zephyr.output.formatter.OutputFormatter class. The interface is as follows:
public interface OutputFormatter {
byte [] formatRecord(Record record);
}
Currently most of our focus has been in ensuring we can output data to HDFS. Our current platform support;
MapReduce, Standalone - they both expect us to be dealing in byte arrays for our output. However, there may come a
time when we need to support a more abstracted view - in essence, we're asking you to be cognizant that this portion
of our API may change. It, like our Canonicalization / NamingService, need to be more robust.