gov.llnl.ontology.text.corpora
Class UkWacDocumentReader

java.lang.Object
  extended by gov.llnl.ontology.text.corpora.UkWacDocumentReader
All Implemented Interfaces:
DocumentReader
Direct Known Subclasses:
WackypediaDocumentReader

public class UkWacDocumentReader
extends Object
implements DocumentReader

A DocumentReader for the parsed UkWac corpus. Documents are expected to deliminated with "" tags in an xml format. Each sentence is expected to be in the UkWac CoNLL format. The url in the id attribute of text is the key and text, and it's hash value is the id for each document.

This is not thread safe.

Author:
Keith Stevens

Field Summary
static String CORPUS_NAME
           
 
Constructor Summary
UkWacDocumentReader()
           
 
Method Summary
 String corpusName()
          Returns CORPUS_NAME
 Document readDocument(String doc)
          Returns a Document represented by the given string.
 Document readDocument(String doc, String corpusName)
          Returns a Document represented by the given string and uses corpusName as the corpus name for the returned Document.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CORPUS_NAME

public static final String CORPUS_NAME
See Also:
Constant Field Values
Constructor Detail

UkWacDocumentReader

public UkWacDocumentReader()
Method Detail

corpusName

public String corpusName()
Returns CORPUS_NAME


readDocument

public Document readDocument(String doc)
Returns a Document represented by the given string.

Specified by:
readDocument in interface DocumentReader

readDocument

public Document readDocument(String doc,
                             String corpusName)
Returns a Document represented by the given string and uses corpusName as the corpus name for the returned Document.

Specified by:
readDocument in interface DocumentReader


Copyright © 2010-2011. All Rights Reserved.