|
SDARTS | |||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Object | +--edu.columbia.cs.sdarts.backend.doc.lucene.DocumentEnum
An abstract class that enables the
LuceneSearchEngine
to extract the com.lucene.document.Documents from a collection,
when it is constructing an index.
Due to the high memory overhead of loading, parsing, and creating Lucene
Documents, these documents are only extracted in batches.
The user of this class should do the following:
DocConfig
getDocuments() method,
until the isEmpty() method returns true.
stdout as it runs.
All of the above functionality, including the batching, understanding which files to access, and some postprocessing (see below) is built into this class.
The abstract portion is the
createDocument()
method, which creates one Lucene Document from one file.
This portion varies depanding on the format of the underlying document.
This class also includes some helper methods for a developer implementing
createDocument(): the parseDate(String) method,
which turns an incoming String into a format that Lucene
can understand, and the makeValue(String) method, which will
replace illegal XML entities like <, >, and & with their
encoded substitutes. You should use these methods frequently inside
an implementation of createDocument().
Writing an implementation of this method is non-trivial and requires knowledge of Lucene itself - visit the Lucene web site to learn how to program with Lucene.
The post-processing specifies default values for certain fields, if they
were not specified in the DocConfig used:
DocConfig, plus
a "/", plus the filename
| Field Summary | |
static int |
BATCH_SIZE
The maximum number of documents to be returned in a batch. |
| Constructor Summary | |
DocumentEnum()
|
|
| Method Summary | |
abstract com.lucene.document.Document |
createDocument(java.io.File file,
org.omg.CORBA.IntHolder storeTokenCountHere)
The abstract method specifying how an incoming file is actually parsed into a Lucene Document. |
DocConfig |
getDocConfig()
Return the DocConfig used for initializing |
com.lucene.document.Document[] |
getDocuments()
Load, parse, and return a batch of Lucene Documents
from the underlying collection. |
void |
initialize(DocConfig docConfig)
Create a new DocumentEnum and initialize it with
a DocConfig, which tells it how to parse the documents |
boolean |
isEmpty()
Whether the DocumentBuilder has run out of Lucene
Documents to return. |
java.lang.String |
makeValue(java.lang.String val)
A helper method for implementations of createDocument(). |
long |
parseDate(java.lang.String dateString)
A helper method for implementations of createDocument(). |
| Methods inherited from class java.lang.Object |
|
| Field Detail |
public static final int BATCH_SIZE
| Constructor Detail |
public DocumentEnum()
| Method Detail |
public final void initialize(DocConfig docConfig)
throws BackEndException
DocumentEnum and initialize it with
a DocConfig, which tells it how to parse the documentsdocConfig - the DocConfig with which to initializepublic boolean isEmpty()
DocumentBuilder has run out of Lucene
Documents to return.DocumentBuilder has run out of Lucene
Documents to return.public DocConfig getDocConfig()
DocConfig used for initializingDocConfig used for initializing
public final com.lucene.document.Document[] getDocuments()
throws BackEndException
Documents
from the underlying collection.Documents, or
null if the DocumentEnum has run out
of Documents.BackEndException - if something goes wrong
public abstract com.lucene.document.Document createDocument(java.io.File file,
org.omg.CORBA.IntHolder storeTokenCountHere)
throws BackEndException
Document. An implementation of
this method ought to make use of the parseDate()
and makeValue() methods.file - the File to turn into a Lucene
DocumentstoreTokenCountHere - an OUT parameter; an implementor of this
method should write the number of tokens in the file into the
value field of this IntHolderDocument generated from the fileBackEndException - if something goes wrong
public final long parseDate(java.lang.String dateString)
throws BackEndException
createDocument().
Turns an incoming String into a numerical format that
Lucene can understand. This should be applied to all Strings
found in a file being parsed that are going into date fields.dateString - the String storing some kind of datepublic java.lang.String makeValue(java.lang.String val)
createDocument().
Will build a String from the incoming parameters,
and will replace <, >, and & characters with their entity names
so they do not form illegal XML expressions. Apply this to nearly
every String you read from the the file!val - the String to clean upString whose illegal XML entities have been
replaced by encodings.
|
SDARTS | |||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||