gov.llnl.ontology.text.corpora
Class UkWacDocumentReader
java.lang.Object
gov.llnl.ontology.text.corpora.UkWacDocumentReader
- All Implemented Interfaces:
- DocumentReader
- Direct Known Subclasses:
- WackypediaDocumentReader
public class UkWacDocumentReader
- extends Object
- implements DocumentReader
A DocumentReader
for the parsed UkWac corpus. Documents are expected
to deliminated with "" tags in an xml format. Each sentence is
expected to be in the UkWac CoNLL format. The url in the id
attribute of text
is the key and text, and it's hash value is the id
for each document.
This is not thread safe.
- Author:
- Keith Stevens
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
CORPUS_NAME
public static final String CORPUS_NAME
- See Also:
- Constant Field Values
UkWacDocumentReader
public UkWacDocumentReader()
corpusName
public String corpusName()
- Returns
CORPUS_NAME
readDocument
public Document readDocument(String doc)
- Returns a
Document
represented by the given string.
- Specified by:
readDocument
in interface DocumentReader
readDocument
public Document readDocument(String doc,
String corpusName)
- Returns a
Document
represented by the given string and uses
corpusName
as the corpus name for the returned Document
.
- Specified by:
readDocument
in interface DocumentReader
Copyright © 2010-2011. All Rights Reserved.