gov.llnl.ontology.text.hbase
Class GzipTarInputFormat
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
gov.llnl.ontology.text.hbase.GzipTarInputFormat
public class GzipTarInputFormat
- extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
A FileInputFormat for handling gzipped tarball files with each
internal file containing data for a single document. This assumes that the
file, or files, being processed are in raw text format and contain one file
path per line of gzipped tarballs. Each entry in the gzipped tarball will be
considered a single document.
- Author:
- Keith Stevens
| Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat |
addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, isSplitable, listStatus, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
GzipTarInputFormat
public GzipTarInputFormat()
createRecordReader
public org.apache.hadoop.mapreduce.RecordReader createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.TaskAttemptContext context)
throws IOException,
InterruptedException
- Returns a
GzipTarInputFormat.GzipTarRecordReader. The record reader will return
each tarred file.
- Specified by:
createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
- Throws:
IOException
InterruptedException
getSplits
public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
throws IOException
- Returns a
List of FileSplits. Each FileSplit
will be a gzipped tarball of xml documents. Each tarred file should
contain a single document.
- Overrides:
getSplits in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
- Throws:
IOException
Copyright © 2010-2011. All Rights Reserved.