com.ebay.erl.mobius.core.builder
Class Dataset

java.lang.Object
  extended by com.ebay.erl.mobius.core.builder.Dataset
All Implemented Interfaces:
java.io.Serializable
Direct Known Subclasses:
SeqFileDataset, TSVDataset

public class Dataset
extends java.lang.Object
implements java.io.Serializable

Represents a type of data on the Hadoop cluster.

A dataset contains the following information:

An instance of Dataset is built by an implementation of AbstractDatasetBuilder.

This product is licensed under the Apache License, Version 2.0, available at http://www.apache.org/licenses/LICENSE-2.0. This product contains portions derived from Apache hadoop which is licensed under the Apache License, Version 2.0, available at http://hadoop.apache.org. © 2007 – 2012 eBay Inc., Evan Chiu, Woody Zhou, Neel Sundaresan

See Also:
Serialized Form

Field Summary
protected  java.util.ArrayList<ComputedColumns> computedColumns
          To store user defined ComputedColumns of this dataset, if any.
protected  org.apache.hadoop.conf.Configuration conf
          Hadoop configuration.
protected  java.lang.Class<? extends org.apache.hadoop.mapred.InputFormat> input_format
          The InputFormat of this dataset, so Hadoop knows how to read this mapper.
protected  MobiusJob job
          The mobius job contains the ananlysis flow.
protected  java.lang.Class<? extends AbstractMobiusMapper> mapper
          The corresponding AbstractMobiusMapper implementation which parse the records of this dataset input Tuple.
protected  java.lang.String name
          name of this dataset.
protected  java.util.LinkedHashSet<java.lang.String> schema
          The schema of this Dataset, using LinkedHashSet to preserve the schema order.
protected  TupleCriterion tupleConstraint
          the tuple constraint.
 
Constructor Summary
protected Dataset(MobiusJob job, java.lang.String name)
           
 
Method Summary
protected  void addComputedColumn(ComputedColumns aComputedColumn)
          Add a ComputedColumns to this dataset.
 org.apache.hadoop.mapred.JobConf createJobConf(int jobSequenceNumber)
          Create a Hadoop JobConf that represents this dataset.
 boolean equals(java.lang.Object obj)
          Return true only if the obj is an instance of Dataset, the name, input format, mapper, and the schema of this and the obj are both equals.
 java.lang.String getDatasetID(int jobSequenceNumber)
          Get the ID for this dataset.
 java.lang.Class<? extends org.apache.hadoop.mapred.InputFormat> getInputFormat()
          Get the InputFormat of this dataset.
 java.util.List<org.apache.hadoop.fs.Path> getInputs()
          Get the input paths of this dataset.
 java.lang.Class<? extends AbstractMobiusMapper> getMapper()
          Get the AbstractMobiusMapper of this dataset.
 java.lang.String getName()
          Get the name of this dataset.
protected  java.util.LinkedHashSet<java.lang.String> getSchema()
          Get the schema of this Dataset.
 int hashCode()
           
protected  void initialize()
          The initializer, this is called everytime when a new Dataset instance is created by a AbstractDatasetBuilder
 Dataset orderBy(java.lang.Class<? extends org.apache.hadoop.mapred.FileOutputFormat> outputformat, Sorter... sorters)
          Sort this Dataset by the given sorters.
 Dataset orderBy(org.apache.hadoop.fs.Path output, java.lang.Class<? extends org.apache.hadoop.mapred.FileOutputFormat> outputformat, Sorter... sorters)
          Sort this Dataset by the given sorters.
 Dataset orderBy(org.apache.hadoop.fs.Path output, Sorter... sorters)
          Sort this Dataset by the given sorters.
 Dataset orderBy(Sorter... sorters)
          Sort this Dataset by the given sorters.
protected  void setInputFormat(java.lang.Class<? extends org.apache.hadoop.mapred.InputFormat> input_format)
          Specified the InputFormat of this dataset.
protected  void setMapper(java.lang.Class<? extends AbstractMobiusMapper> mapper)
          Set the AbstractMobiusMapper for this dataset.
protected  void setSchema(java.lang.String... schema)
          Specified the schema of this dataset.
 java.lang.String toString()
          return a string contain the name of this dataset and its schema.
protected  void validate()
          validate if this dataset has all the required parameter
 boolean withinSchema(java.lang.String aColumn)
          Check for a given aColumn, if it is defined in this dataset or not.
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

input_format

protected java.lang.Class<? extends org.apache.hadoop.mapred.InputFormat> input_format
The InputFormat of this dataset, so Hadoop knows how to read this mapper.


mapper

protected java.lang.Class<? extends AbstractMobiusMapper> mapper
The corresponding AbstractMobiusMapper implementation which parse the records of this dataset input Tuple.


tupleConstraint

protected TupleCriterion tupleConstraint
the tuple constraint.

If specified, only tuples that pass this constraint will be emitted.


conf

protected transient org.apache.hadoop.conf.Configuration conf
Hadoop configuration.


computedColumns

protected transient java.util.ArrayList<ComputedColumns> computedColumns
To store user defined ComputedColumns of this dataset, if any.


schema

protected java.util.LinkedHashSet<java.lang.String> schema
The schema of this Dataset, using LinkedHashSet to preserve the schema order.


name

protected java.lang.String name
name of this dataset.


job

protected transient MobiusJob job
The mobius job contains the ananlysis flow.

Constructor Detail

Dataset

protected Dataset(MobiusJob job,
                  java.lang.String name)
Method Detail

getSchema

protected java.util.LinkedHashSet<java.lang.String> getSchema()
Get the schema of this Dataset.

The returned set is a LinkedHashSet, schema is sorted in the insertion order.


createJobConf

public org.apache.hadoop.mapred.JobConf createJobConf(int jobSequenceNumber)
                                               throws java.io.IOException
Create a Hadoop JobConf that represents this dataset.

This method is called by Mobius.

Throws:
java.io.IOException

getDatasetID

public java.lang.String getDatasetID(int jobSequenceNumber)
Get the ID for this dataset.

A dataset id is composed of two digits of integer (from the jobSequenceNumber) and the name of the dataset.

This method is used by Mobius engine only.


setSchema

protected void setSchema(java.lang.String... schema)
Specified the schema of this dataset.


addComputedColumn

protected void addComputedColumn(ComputedColumns aComputedColumn)
Add a ComputedColumns to this dataset.

This method is called by an implementation of AbstractDatasetBuilder.


initialize

protected void initialize()
The initializer, this is called everytime when a new Dataset instance is created by a AbstractDatasetBuilder


setInputFormat

protected void setInputFormat(java.lang.Class<? extends org.apache.hadoop.mapred.InputFormat> input_format)
Specified the InputFormat of this dataset. This method is called by the corresponding implementation of AbstractDatasetBuilder.


setMapper

protected void setMapper(java.lang.Class<? extends AbstractMobiusMapper> mapper)
Set the AbstractMobiusMapper for this dataset.

This method is called by the corresponding implementation of AbstractDatasetBuilder.


getInputs

public java.util.List<org.apache.hadoop.fs.Path> getInputs()
Get the input paths of this dataset.

Paths are specified by the user during the dataset building process.


getMapper

public java.lang.Class<? extends AbstractMobiusMapper> getMapper()
Get the AbstractMobiusMapper of this dataset.


getInputFormat

public java.lang.Class<? extends org.apache.hadoop.mapred.InputFormat> getInputFormat()
Get the InputFormat of this dataset.


withinSchema

public boolean withinSchema(java.lang.String aColumn)
Check for a given aColumn, if it is defined in this dataset or not.

Parameters:
aColumn - the name fo a column.
Returns:
true if the aColumn is defined in this dataset (case insensitive), false other wise.

validate

protected void validate()
validate if this dataset has all the required parameter


getName

public java.lang.String getName()
Get the name of this dataset.

The name of a dataset is specified during the dataset building process.


toString

public java.lang.String toString()
return a string contain the name of this dataset and its schema.

Overrides:
toString in class java.lang.Object

equals

public boolean equals(java.lang.Object obj)
Return true only if the obj is an instance of Dataset, the name, input format, mapper, and the schema of this and the obj are both equals. Otherwise, false.

Overrides:
equals in class java.lang.Object

orderBy

public Dataset orderBy(Sorter... sorters)
                throws java.io.IOException
Sort this Dataset by the given sorters.

Throws:
java.io.IOException

orderBy

public Dataset orderBy(java.lang.Class<? extends org.apache.hadoop.mapred.FileOutputFormat> outputformat,
                       Sorter... sorters)
                throws java.io.IOException
Sort this Dataset by the given sorters.

Throws:
java.io.IOException

orderBy

public Dataset orderBy(org.apache.hadoop.fs.Path output,
                       Sorter... sorters)
                throws java.io.IOException
Sort this Dataset by the given sorters.

Throws:
java.io.IOException

orderBy

public Dataset orderBy(org.apache.hadoop.fs.Path output,
                       java.lang.Class<? extends org.apache.hadoop.mapred.FileOutputFormat> outputformat,
                       Sorter... sorters)
                throws java.io.IOException
Sort this Dataset by the given sorters.

Throws:
java.io.IOException

hashCode

public int hashCode()
Overrides:
hashCode in class java.lang.Object