GzipTarInputFormat (C-Cat 1.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

gov.llnl.ontology.text.hbase
Class GzipTarInputFormat

java.lang.Object
  org.apache.hadoop.mapreduce.InputFormat<K,V>
      org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
          gov.llnl.ontology.text.hbase.GzipTarInputFormat

public class GzipTarInputFormat
extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>

A FileInputFormat for handling gzipped tarball files with each internal file containing data for a single document. This assumes that the file, or files, being processed are in raw text format and contain one file path per line of gzipped tarballs. Each entry in the gzipped tarball will be considered a single document.

Author:: Keith Stevens

Nested Class Summary
`class`	`GzipTarInputFormat.GzipTarRecordReader` A `RecordReader` for processing gzipped tarballs of document files.

Constructor Summary
`GzipTarInputFormat()`

Method Summary
`org.apache.hadoop.mapreduce.RecordReader`	`createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)` Returns a `GzipTarInputFormat.GzipTarRecordReader`.
`List<org.apache.hadoop.mapreduce.InputSplit>`	`getSplits(org.apache.hadoop.mapreduce.JobContext context)` Returns a `List` of `FileSplit`s.

Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
`addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, isSplitable, listStatus, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

GzipTarInputFormat

public GzipTarInputFormat()

Method Detail

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                   org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                            throws IOException,
                                                                   InterruptedException

Returns a GzipTarInputFormat.GzipTarRecordReader. The record reader will return each tarred file.

Specified by:: createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>

Throws:: IOException; InterruptedException

getSplits

public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
                                                       throws IOException

Returns a List of FileSplits. Each FileSplit will be a gzipped tarball of xml documents. Each tarred file should contain a single document.

Overrides:: getSplits in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>

Throws:: IOException