gov.llnl.ontology.mapreduce.table
Class TrinidadTable

java.lang.Object
  extended by gov.llnl.ontology.mapreduce.table.TrinidadTable
All Implemented Interfaces:
CorpusTable, GenericTable
Direct Known Subclasses:
NYT03Table

public class TrinidadTable
extends Object
implements CorpusTable

Author:
Keith Stevens

Field Summary
static String ALL_CORPORA
          A marker to request all corpora types when scanning.
static String ANNOTATION_CF
          The column family for the document annotations.
static String ANNOTATION_SENTENCE
          The column qualifier for the sentence level document annotations.
static String ANNOTATION_TOKEN
          The column qualifier for the token level document annotations.
static String CATEGORY_COLUMN
          The column name for categories that a document may fall under, if any.
static String DOC_ID
          The column name for the document id.
static String DOC_KEY
          The column name for the document key.
static String LABEL_CF
          The column family for word list labels associated wtih each document.
static String META_CF
          The column family for word list labels associated wtih each document.
static String SENSE_SENTENCE_PREFIX
          The column qualifier prefix for sentence level word sense annotations.
static String SENSE_TOKEN_PREFIX
          The column qualifier prefix for token level word sense annotations.
static String SOURCE_CF
          The column family for source related columns.
static String SOURCE_ID
          The column qualifier for the corpus id.
static String SOURCE_IDCOL
          The full column qualifier for the corpus id.
static String SOURCE_NAME
          The column qualifier for the corpus source name.
static String SOURCE_NAMECOL
          The full column qualifier for the corpus source name.
static String TABLE_NAME
          The official table name.
static String TEXT_CF
          The column family for the text colunns.
static String TEXT_ORIGINAL
          The column qualifier for the original document text.
static String TEXT_ORIGINAL_COL
          The full column qualifier for the original document text.
static String TEXT_RAW
          The column qualifier for the cleaned document text.
static String TEXT_RAW_COL
          The full column qualifier for the cleaned document text.
static String TEXT_TITLE
          The column qualifier for the document title.
static String TEXT_TITLE_COL
          The full column qualifier for the document title.
static String TEXT_TYPE
          The column qualifier for the text type.
static String TEXT_TYPE_COL
          The full column qualifier for the text type.
static String XML_MIME_TYPE
          Stores the text type of any document.
 
Constructor Summary
TrinidadTable()
          Creates a new TrinidadTable that uses the default .
 
Method Summary
 void close()
          Closes the connection to the document reader.
 void createTable()
          Creates a new instance of the HTable represented by this GenericTable
 void createTable(org.apache.hadoop.hbase.client.HConnection connector)
          Creates a new instance of the HTable represented by this GenericTable
 Document document(org.apache.hadoop.hbase.client.Result row)
          Returns the Document associated with this row.
 Set<String> getCategories(org.apache.hadoop.hbase.client.Result row)
          Returns the set of categories associated with the document in row.
 String getLabel(org.apache.hadoop.hbase.client.Result row, String labelName)
          Returns the label associated with column labelName inside of row.
 Iterator<org.apache.hadoop.hbase.client.Result> iterator(org.apache.hadoop.hbase.client.Scan scan)
          Returns an iterator over all of the rows accessible from this GenericTable.
 void markRowAsProcessed(org.apache.hadoop.hbase.io.ImmutableBytesWritable key, org.apache.hadoop.hbase.client.Result row)
          Marks the row index by key as having been processed.
 void put(Document document)
          Stores the text of Document in this CorpusTable.
 void put(org.apache.hadoop.hbase.io.ImmutableBytesWritable key, List<Sentence> sentences)
          Stores the List of Sentences in this table.
 void putCategories(org.apache.hadoop.hbase.io.ImmutableBytesWritable key, Set<String> categories)
          Store the categories associated with the document indexed by key.
 void putLabel(org.apache.hadoop.hbase.io.ImmutableBytesWritable key, String labelName, String labelValue)
          Stores the labelValue in the column specified by labelName in the row index by key.
 void putSenses(org.apache.hadoop.hbase.io.ImmutableBytesWritable key, List<Sentence> sentences, String senseLabel)
          Stores the List of Sentences containing only word senses in this table.
 List<Sentence> sentences(org.apache.hadoop.hbase.client.Result row)
          Returns the List of Sentences stored in row.
 void setupScan(org.apache.hadoop.hbase.client.Scan scan)
          Initializes a Scan such that it will request whatever columns and column families are neccesary for processing as determined by the table type.
 void setupScan(org.apache.hadoop.hbase.client.Scan scan, String corpusName)
          Initializes a Scan such that it will request columns and column families are neccesary for extracting the raw document text, dependency trees, and document source information from the specified corpusName.
 boolean shouldProcessRow(org.apache.hadoop.hbase.client.Result row)
          Returns true if the given row should be processed.
 String sourceCorpus(org.apache.hadoop.hbase.client.Result row)
          Returns the source corpus that this row contains.
 org.apache.hadoop.hbase.client.HTable table()
          Returns the HTable instance attached to this GenericTable.
 String tableName()
          Returns the name of the HBase Table that this GenericTable represents.
 String text(org.apache.hadoop.hbase.client.Result row)
          Returns the cleaned text stored by the given row.
 String textSource(org.apache.hadoop.hbase.client.Result row)
          Returns the raw document text stored in row.
 String title(org.apache.hadoop.hbase.client.Result row)
          Retuns the title of the document stored in row.
 List<Sentence> wordSenses(org.apache.hadoop.hbase.client.Result row, String senseLabel)
          Returns the List of Sentence stored in row that correspond to the word senses created with labelName.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

XML_MIME_TYPE

public static final String XML_MIME_TYPE
Stores the text type of any document.

See Also:
Constant Field Values

ALL_CORPORA

public static final String ALL_CORPORA
A marker to request all corpora types when scanning.

See Also:
Constant Field Values

TABLE_NAME

public static final String TABLE_NAME
The official table name.

See Also:
Constant Field Values

SOURCE_CF

public static final String SOURCE_CF
The column family for source related columns.

See Also:
Constant Field Values

SOURCE_NAME

public static final String SOURCE_NAME
The column qualifier for the corpus source name.

See Also:
Constant Field Values

SOURCE_NAMECOL

public static final String SOURCE_NAMECOL
The full column qualifier for the corpus source name.

See Also:
Constant Field Values

SOURCE_ID

public static final String SOURCE_ID
The column qualifier for the corpus id.

See Also:
Constant Field Values

SOURCE_IDCOL

public static final String SOURCE_IDCOL
The full column qualifier for the corpus id.

See Also:
Constant Field Values

TEXT_CF

public static final String TEXT_CF
The column family for the text colunns.

See Also:
Constant Field Values

TEXT_ORIGINAL

public static final String TEXT_ORIGINAL
The column qualifier for the original document text.

See Also:
Constant Field Values

TEXT_ORIGINAL_COL

public static final String TEXT_ORIGINAL_COL
The full column qualifier for the original document text.

See Also:
Constant Field Values

TEXT_TYPE

public static final String TEXT_TYPE
The column qualifier for the text type.

See Also:
Constant Field Values

TEXT_TYPE_COL

public static final String TEXT_TYPE_COL
The full column qualifier for the text type.

See Also:
Constant Field Values

TEXT_RAW

public static final String TEXT_RAW
The column qualifier for the cleaned document text.

See Also:
Constant Field Values

TEXT_RAW_COL

public static final String TEXT_RAW_COL
The full column qualifier for the cleaned document text.

See Also:
Constant Field Values

TEXT_TITLE

public static final String TEXT_TITLE
The column qualifier for the document title.

See Also:
Constant Field Values

TEXT_TITLE_COL

public static final String TEXT_TITLE_COL
The full column qualifier for the document title.

See Also:
Constant Field Values

ANNOTATION_CF

public static final String ANNOTATION_CF
The column family for the document annotations.

See Also:
Constant Field Values

ANNOTATION_SENTENCE

public static final String ANNOTATION_SENTENCE
The column qualifier for the sentence level document annotations.

See Also:
Constant Field Values

ANNOTATION_TOKEN

public static final String ANNOTATION_TOKEN
The column qualifier for the token level document annotations.

See Also:
Constant Field Values

SENSE_SENTENCE_PREFIX

public static final String SENSE_SENTENCE_PREFIX
The column qualifier prefix for sentence level word sense annotations.

See Also:
Constant Field Values

SENSE_TOKEN_PREFIX

public static final String SENSE_TOKEN_PREFIX
The column qualifier prefix for token level word sense annotations.

See Also:
Constant Field Values

LABEL_CF

public static final String LABEL_CF
The column family for word list labels associated wtih each document.

See Also:
Constant Field Values

META_CF

public static final String META_CF
The column family for word list labels associated wtih each document.

See Also:
Constant Field Values

CATEGORY_COLUMN

public static final String CATEGORY_COLUMN
The column name for categories that a document may fall under, if any.

See Also:
Constant Field Values

DOC_KEY

public static final String DOC_KEY
The column name for the document key.

See Also:
Constant Field Values

DOC_ID

public static final String DOC_ID
The column name for the document id.

See Also:
Constant Field Values
Constructor Detail

TrinidadTable

public TrinidadTable()
Creates a new TrinidadTable that uses the default .

Method Detail

createTable

public void createTable()
Creates a new instance of the HTable represented by this GenericTable

Specified by:
createTable in interface GenericTable

createTable

public void createTable(org.apache.hadoop.hbase.client.HConnection connector)
Creates a new instance of the HTable represented by this GenericTable

Specified by:
createTable in interface GenericTable

setupScan

public void setupScan(org.apache.hadoop.hbase.client.Scan scan)
Initializes a Scan such that it will request whatever columns and column families are neccesary for processing as determined by the table type. This method will only be called once per job.

Specified by:
setupScan in interface GenericTable

setupScan

public void setupScan(org.apache.hadoop.hbase.client.Scan scan,
                      String corpusName)
Initializes a Scan such that it will request columns and column families are neccesary for extracting the raw document text, dependency trees, and document source information from the specified corpusName.

Specified by:
setupScan in interface GenericTable

iterator

public Iterator<org.apache.hadoop.hbase.client.Result> iterator(org.apache.hadoop.hbase.client.Scan scan)
Returns an iterator over all of the rows accessible from this GenericTable.

Specified by:
iterator in interface GenericTable

tableName

public String tableName()
Returns the name of the HBase Table that this GenericTable represents.

Specified by:
tableName in interface GenericTable

table

public org.apache.hadoop.hbase.client.HTable table()
Returns the HTable instance attached to this GenericTable.

Specified by:
table in interface GenericTable

text

public String text(org.apache.hadoop.hbase.client.Result row)
Returns the cleaned text stored by the given row.

Specified by:
text in interface CorpusTable

textSource

public String textSource(org.apache.hadoop.hbase.client.Result row)
Returns the raw document text stored in row.

Specified by:
textSource in interface CorpusTable

sourceCorpus

public String sourceCorpus(org.apache.hadoop.hbase.client.Result row)
Returns the source corpus that this row contains.

Specified by:
sourceCorpus in interface CorpusTable

title

public String title(org.apache.hadoop.hbase.client.Result row)
Retuns the title of the document stored in row.

Specified by:
title in interface CorpusTable

sentences

public List<Sentence> sentences(org.apache.hadoop.hbase.client.Result row)
Returns the List of Sentences stored in row. This call will include all annotations requested in the setup call to GenericTable.setupScan(org.apache.hadoop.hbase.client.Scan).

Specified by:
sentences in interface CorpusTable

wordSenses

public List<Sentence> wordSenses(org.apache.hadoop.hbase.client.Result row,
                                 String senseLabel)
Returns the List of Sentence stored in row that correspond to the word senses created with labelName.

Specified by:
wordSenses in interface CorpusTable

document

public Document document(org.apache.hadoop.hbase.client.Result row)
Returns the Document associated with this row.

Specified by:
document in interface CorpusTable

put

public void put(Document document)
Stores the text of Document in this CorpusTable.

Specified by:
put in interface CorpusTable

put

public void put(org.apache.hadoop.hbase.io.ImmutableBytesWritable key,
                List<Sentence> sentences)
Stores the List of Sentences in this table. Implementations are welcome to stores this List as a complete object or as a seperate set of smaller Annotations.

Specified by:
put in interface CorpusTable

putSenses

public void putSenses(org.apache.hadoop.hbase.io.ImmutableBytesWritable key,
                      List<Sentence> sentences,
                      String senseLabel)
Stores the List of Sentences containing only word senses in this table.

Specified by:
putSenses in interface CorpusTable

putCategories

public void putCategories(org.apache.hadoop.hbase.io.ImmutableBytesWritable key,
                          Set<String> categories)
Store the categories associated with the document indexed by key.

Specified by:
putCategories in interface CorpusTable

getCategories

public Set<String> getCategories(org.apache.hadoop.hbase.client.Result row)
Returns the set of categories associated with the document in row.

Specified by:
getCategories in interface CorpusTable

putLabel

public void putLabel(org.apache.hadoop.hbase.io.ImmutableBytesWritable key,
                     String labelName,
                     String labelValue)
Stores the labelValue in the column specified by labelName in the row index by key.

Specified by:
putLabel in interface CorpusTable

getLabel

public String getLabel(org.apache.hadoop.hbase.client.Result row,
                       String labelName)
Returns the label associated with column labelName inside of row.

Specified by:
getLabel in interface CorpusTable

shouldProcessRow

public boolean shouldProcessRow(org.apache.hadoop.hbase.client.Result row)
Returns true if the given row should be processed.

Specified by:
shouldProcessRow in interface CorpusTable

markRowAsProcessed

public void markRowAsProcessed(org.apache.hadoop.hbase.io.ImmutableBytesWritable key,
                               org.apache.hadoop.hbase.client.Result row)
Marks the row index by key as having been processed.

Specified by:
markRowAsProcessed in interface CorpusTable

close

public void close()
Closes the connection to the document reader.

Specified by:
close in interface GenericTable


Copyright © 2010-2011. All Rights Reserved.