com.ebay.erl.mobius.core.builder
Class TSVDatasetBuilder

java.lang.Object
  extended by com.ebay.erl.mobius.core.builder.AbstractDatasetBuilder<TSVDatasetBuilder>
      extended by com.ebay.erl.mobius.core.builder.TSVDatasetBuilder

public class TSVDatasetBuilder
extends AbstractDatasetBuilder<TSVDatasetBuilder>

Represents text-based and line-oriented files on HDFS.

Each line is delimited by the delimiter, ( the default is tab), and the file is assigned the schema given in the constructor.

If the number of values in a line is less than the length of the schema, those columns are assigned a null value.

If the number of values in a line is greater than the length of the schema, those values are put into the tuple with the name IDX_$i, where $i starts from the length of the given schema.

This product is licensed under the Apache License, Version 2.0, available at http://www.apache.org/licenses/LICENSE-2.0. This product contains portions derived from Apache hadoop which is licensed under the Apache License, Version 2.0, available at http://hadoop.apache.org. © 2007 – 2012 eBay Inc., Evan Chiu, Woody Zhou, Neel Sundaresan


Field Summary
 
Fields inherited from class com.ebay.erl.mobius.core.builder.AbstractDatasetBuilder
computedColumns, datasetName, mobiusJob
 
Constructor Summary
protected TSVDatasetBuilder(MobiusJob job, java.lang.String datasetName)
           
 
Method Summary
 Dataset buildFromPreviousJob(org.apache.hadoop.mapred.JobConf prevJob, java.lang.Class<? extends org.apache.hadoop.mapred.FileOutputFormat> prevJobOutputFormat, java.lang.String[] schema)
          To be called by Mobius engine, for building a dataset from a previous mobius job, user should not use this method.
protected  Dataset newDataset(java.lang.String datasetName)
          Create a new Dataset, the returned Dataset has no state at all (no paths, constraints...etc.)
static TSVDatasetBuilder newInstance(MobiusJob job, java.lang.String name, java.lang.String[] schema)
          Create a new instance of TSVDatasetBuilder to build a text based dataset.
 TSVDatasetBuilder setDelimiter(java.lang.String delimiter)
          Specify the delimiter for the underline text file.
 TSVDatasetBuilder setMapper(java.lang.Class<? extends TSVMapper> mapper)
          Change the default mapper implementation (default one is TSVMapper), user should call this mapper when the parsing logic in TSVMapper doesn't meet the requirement.
 
Methods inherited from class com.ebay.erl.mobius.core.builder.AbstractDatasetBuilder
addComuptedColumn, addInputPath, addInputPath, build, checkTouchFile, constraint, getDataset, setSchema
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TSVDatasetBuilder

protected TSVDatasetBuilder(MobiusJob job,
                            java.lang.String datasetName)
                     throws java.io.IOException
Throws:
java.io.IOException
Method Detail

newInstance

public static TSVDatasetBuilder newInstance(MobiusJob job,
                                            java.lang.String name,
                                            java.lang.String[] schema)
                                     throws java.io.IOException
Create a new instance of TSVDatasetBuilder to build a text based dataset.

By default, the underline text file will be read and delimited by tab in TSVMapper, then be converted into Tuple and set the given schema. See TSVMapper for more detail.

Parameters:
job - a Mobius job contains an analysis flow.
name - the name of the dataset to be built.
schema - the schema of the underline dataset.
Returns:
Throws:
java.io.IOException

setDelimiter

public TSVDatasetBuilder setDelimiter(java.lang.String delimiter)
                               throws java.io.IOException
Specify the delimiter for the underline text file.

The default delimiter is tab.

Parameters:
delimiter -
Returns:
the TSVDatasetBuilder itself.
Throws:
java.io.IOException

newDataset

protected Dataset newDataset(java.lang.String datasetName)
Description copied from class: AbstractDatasetBuilder
Create a new Dataset, the returned Dataset has no state at all (no paths, constraints...etc.)

Specified by:
newDataset in class AbstractDatasetBuilder<TSVDatasetBuilder>

setMapper

public TSVDatasetBuilder setMapper(java.lang.Class<? extends TSVMapper> mapper)
Change the default mapper implementation (default one is TSVMapper), user should call this mapper when the parsing logic in TSVMapper doesn't meet the requirement.

Parameters:
mapper -
Returns:

buildFromPreviousJob

public Dataset buildFromPreviousJob(org.apache.hadoop.mapred.JobConf prevJob,
                                    java.lang.Class<? extends org.apache.hadoop.mapred.FileOutputFormat> prevJobOutputFormat,
                                    java.lang.String[] schema)
                             throws java.io.IOException
To be called by Mobius engine, for building a dataset from a previous mobius job, user should not use this method.

Overrides:
buildFromPreviousJob in class AbstractDatasetBuilder<TSVDatasetBuilder>
Throws:
java.io.IOException