gov.llnl.ontology.text.corpora
Class UkWacDocumentReader
java.lang.Object
gov.llnl.ontology.text.corpora.UkWacDocumentReader
- All Implemented Interfaces:
- DocumentReader
- Direct Known Subclasses:
- WackypediaDocumentReader
public class UkWacDocumentReader
- extends Object
- implements DocumentReader
A DocumentReader for the parsed UkWac corpus. Documents are expected
to deliminated with "" tags in an xml format. Each sentence is
expected to be in the UkWac CoNLL format. The url in the id
attribute of text is the key and text, and it's hash value is the id
for each document.
This is not thread safe.
- Author:
- Keith Stevens
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
CORPUS_NAME
public static final String CORPUS_NAME
- See Also:
- Constant Field Values
UkWacDocumentReader
public UkWacDocumentReader()
corpusName
public String corpusName()
- Returns
CORPUS_NAME
readDocument
public Document readDocument(String doc)
- Returns a
Document represented by the given string.
- Specified by:
readDocument in interface DocumentReader
readDocument
public Document readDocument(String doc,
String corpusName)
- Returns a
Document represented by the given string and uses
corpusName as the corpus name for the returned Document.
- Specified by:
readDocument in interface DocumentReader
Copyright © 2010-2011. All Rights Reserved.