gov.llnl.ontology.mapreduce.table
Interface CorpusTable

All Superinterfaces:
GenericTable
All Known Implementing Classes:
NYT03Table, TrinidadTable

public interface CorpusTable
extends GenericTable

An interface for interacting with a document based HBase table. The HBase table should have at least three key values for each row: the raw document text, the corpus name from which the text came, and a dependency parse tree. This interface allows all extraction code a fixed method for accessing these data values. Each data piece must be extractable from a Result instance. Each Result must also refer to only one document, from a single source.

All implementations should have a no argument constructor, since the DocumentReaders are often instantiated through reflection. Implementations for all methods, except for setupScan should also be stateless and threadsafe. The accessor methods will be called from multiple threads in no particular order.

Author:
Keith Stevens

Method Summary
 Document document(org.apache.hadoop.hbase.client.Result row)
          Returns the Document associated with this row.
 Set<String> getCategories(org.apache.hadoop.hbase.client.Result row)
          Returns the set of categories associated with the document in row.
 String getLabel(org.apache.hadoop.hbase.client.Result row, String labelName)
          Returns the label associated with column labelName inside of row.
 void markRowAsProcessed(org.apache.hadoop.hbase.io.ImmutableBytesWritable key, org.apache.hadoop.hbase.client.Result row)
          Marks the row index by key as having been processed.
 void put(Document document)
          Stores the text of Document in this CorpusTable.
 void put(org.apache.hadoop.hbase.io.ImmutableBytesWritable key, List<Sentence> sentences)
          Stores the List of Sentences in this table.
 void putCategories(org.apache.hadoop.hbase.io.ImmutableBytesWritable key, Set<String> categories)
          Store the categories associated with the document indexed by key.
 void putLabel(org.apache.hadoop.hbase.io.ImmutableBytesWritable key, String labelName, String labelValue)
          Stores the labelValue in the column specified by labelName in the row index by key.
 void putSenses(org.apache.hadoop.hbase.io.ImmutableBytesWritable key, List<Sentence> senses, String senseLabel)
          Stores the List of Sentences containing only word senses in this table.
 List<Sentence> sentences(org.apache.hadoop.hbase.client.Result row)
          Returns the List of Sentences stored in row.
 boolean shouldProcessRow(org.apache.hadoop.hbase.client.Result row)
          Returns true if the given row should be processed.
 String sourceCorpus(org.apache.hadoop.hbase.client.Result row)
          Returns the source corpus that this row contains.
 String text(org.apache.hadoop.hbase.client.Result row)
          Returns the cleaned text stored by the given row.
 String textSource(org.apache.hadoop.hbase.client.Result row)
          Returns the raw document text stored in row.
 String title(org.apache.hadoop.hbase.client.Result row)
          Retuns the title of the document stored in row.
 List<Sentence> wordSenses(org.apache.hadoop.hbase.client.Result row, String labelName)
          Returns the List of Sentence stored in row that correspond to the word senses created with labelName.
 
Methods inherited from interface gov.llnl.ontology.mapreduce.table.GenericTable
close, createTable, createTable, iterator, setupScan, setupScan, table, tableName
 

Method Detail

text

String text(org.apache.hadoop.hbase.client.Result row)
Returns the cleaned text stored by the given row.


textSource

String textSource(org.apache.hadoop.hbase.client.Result row)
Returns the raw document text stored in row.


title

String title(org.apache.hadoop.hbase.client.Result row)
Retuns the title of the document stored in row.


sourceCorpus

String sourceCorpus(org.apache.hadoop.hbase.client.Result row)
Returns the source corpus that this row contains.


sentences

List<Sentence> sentences(org.apache.hadoop.hbase.client.Result row)
Returns the List of Sentences stored in row. This call will include all annotations requested in the setup call to GenericTable.setupScan(org.apache.hadoop.hbase.client.Scan).


wordSenses

List<Sentence> wordSenses(org.apache.hadoop.hbase.client.Result row,
                          String labelName)
Returns the List of Sentence stored in row that correspond to the word senses created with labelName.


document

Document document(org.apache.hadoop.hbase.client.Result row)
Returns the Document associated with this row.


put

void put(Document document)
Stores the text of Document in this CorpusTable.


put

void put(org.apache.hadoop.hbase.io.ImmutableBytesWritable key,
         List<Sentence> sentences)
Stores the List of Sentences in this table. Implementations are welcome to stores this List as a complete object or as a seperate set of smaller Annotations.


putSenses

void putSenses(org.apache.hadoop.hbase.io.ImmutableBytesWritable key,
               List<Sentence> senses,
               String senseLabel)
Stores the List of Sentences containing only word senses in this table.


putLabel

void putLabel(org.apache.hadoop.hbase.io.ImmutableBytesWritable key,
              String labelName,
              String labelValue)
Stores the labelValue in the column specified by labelName in the row index by key.


getLabel

String getLabel(org.apache.hadoop.hbase.client.Result row,
                String labelName)
Returns the label associated with column labelName inside of row.


putCategories

void putCategories(org.apache.hadoop.hbase.io.ImmutableBytesWritable key,
                   Set<String> categories)
Store the categories associated with the document indexed by key.


getCategories

Set<String> getCategories(org.apache.hadoop.hbase.client.Result row)
Returns the set of categories associated with the document in row.


shouldProcessRow

boolean shouldProcessRow(org.apache.hadoop.hbase.client.Result row)
Returns true if the given row should be processed.


markRowAsProcessed

void markRowAsProcessed(org.apache.hadoop.hbase.io.ImmutableBytesWritable key,
                        org.apache.hadoop.hbase.client.Result row)
Marks the row index by key as having been processed.



Copyright © 2010-2011. All Rights Reserved.