gov.llnl.ontology.text.hbase
Class GzipTarInputFormat
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
gov.llnl.ontology.text.hbase.GzipTarInputFormat
public class GzipTarInputFormat
- extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
A FileInputFormat
for handling gzipped tarball files with each
internal file containing data for a single document. This assumes that the
file, or files, being processed are in raw text format and contain one file
path per line of gzipped tarballs. Each entry in the gzipped tarball will be
considered a single document.
- Author:
- Keith Stevens
Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat |
addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, isSplitable, listStatus, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
GzipTarInputFormat
public GzipTarInputFormat()
createRecordReader
public org.apache.hadoop.mapreduce.RecordReader createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.TaskAttemptContext context)
throws IOException,
InterruptedException
- Returns a
GzipTarInputFormat.GzipTarRecordReader
. The record reader will return
each tarred file.
- Specified by:
createRecordReader
in class org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
- Throws:
IOException
InterruptedException
getSplits
public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
throws IOException
- Returns a
List
of FileSplit
s. Each FileSplit
will be a gzipped tarball of xml documents. Each tarred file should
contain a single document.
- Overrides:
getSplits
in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
- Throws:
IOException
Copyright © 2010-2011. All Rights Reserved.