gov.llnl.ontology.text.hbase
Class GzipXmlInputFormat
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
gov.llnl.ontology.text.hbase.GzipXmlInputFormat
public class GzipXmlInputFormat
- extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
A FileInputFormat
for xml files that are gzipped. Before starting a
job, call setXMLTags(org.apache.hadoop.mapreduce.Job, java.lang.String)
to specify the text of the document
deliminiting tags.
- Author:
- Keith Stevens
Method Summary |
org.apache.hadoop.mapreduce.RecordReader |
createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.TaskAttemptContext context)
Returns a GzipXmlRecordReader . |
List<org.apache.hadoop.mapreduce.InputSplit> |
getSplits(org.apache.hadoop.mapreduce.JobContext context)
Returns a List of FileSplit s. |
static void |
setXMLTags(org.apache.hadoop.mapreduce.Job job,
String delimiter)
|
Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat |
addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, isSplitable, listStatus, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
GzipXmlInputFormat
public GzipXmlInputFormat()
setXMLTags
public static void setXMLTags(org.apache.hadoop.mapreduce.Job job,
String delimiter)
createRecordReader
public org.apache.hadoop.mapreduce.RecordReader createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.TaskAttemptContext context)
throws IOException,
InterruptedException
- Returns a
GzipXmlRecordReader
. The record reader will return
each tarred file.
- Specified by:
createRecordReader
in class org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
- Throws:
IOException
InterruptedException
getSplits
public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
throws IOException
- Returns a
List
of FileSplit
s. Each FileSplit
will be a gzipped tarball of xml documents. Each tarred file should
contain a single document.
- Overrides:
getSplits
in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
- Throws:
IOException
Copyright © 2010-2011. All Rights Reserved.