SDARTS

edu.columbia.cs.sdarts.backend.doc.lucene
Class DocumentEnum

java.lang.Object
  |
  +--edu.columbia.cs.sdarts.backend.doc.lucene.DocumentEnum
Direct Known Subclasses:
TextDocumentEnum, XMLDocumentEnum

public abstract class DocumentEnum
extends java.lang.Object

An abstract class that enables the LuceneSearchEngine to extract the com.lucene.document.Documents from a collection, when it is constructing an index.

Due to the high memory overhead of loading, parsing, and creating Lucene Documents, these documents are only extracted in batches. The user of this class should do the following:

This class reports its progress to stdout as it runs.

All of the above functionality, including the batching, understanding which files to access, and some postprocessing (see below) is built into this class.

The abstract portion is the createDocument() method, which creates one Lucene Document from one file. This portion varies depanding on the format of the underlying document. This class also includes some helper methods for a developer implementing createDocument(): the parseDate(String) method, which turns an incoming String into a format that Lucene can understand, and the makeValue(String) method, which will replace illegal XML entities like <, >, and & with their encoded substitutes. You should use these methods frequently inside an implementation of createDocument().

Writing an implementation of this method is non-trivial and requires knowledge of Lucene itself - visit the Lucene web site to learn how to program with Lucene.

The post-processing specifies default values for certain fields, if they were not specified in the DocConfig used:


Field Summary
static int BATCH_SIZE
          The maximum number of documents to be returned in a batch.
 
Constructor Summary
DocumentEnum()
           
 
Method Summary
abstract  com.lucene.document.Document createDocument(java.io.File file, org.omg.CORBA.IntHolder storeTokenCountHere)
          The abstract method specifying how an incoming file is actually parsed into a Lucene Document.
 DocConfig getDocConfig()
          Return the DocConfig used for initializing
 com.lucene.document.Document[] getDocuments()
          Load, parse, and return a batch of Lucene Documents from the underlying collection.
 void initialize(DocConfig docConfig)
          Create a new DocumentEnum and initialize it with a DocConfig, which tells it how to parse the documents
 boolean isEmpty()
          Whether the DocumentBuilder has run out of Lucene Documents to return.
 java.lang.String makeValue(java.lang.String val)
          A helper method for implementations of createDocument().
 long parseDate(java.lang.String dateString)
          A helper method for implementations of createDocument().
 
Methods inherited from class java.lang.Object
, clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

BATCH_SIZE

public static final int BATCH_SIZE
The maximum number of documents to be returned in a batch. Right now this is fixed, but it may be dynamically determined in a later release
Constructor Detail

DocumentEnum

public DocumentEnum()
Method Detail

initialize

public final void initialize(DocConfig docConfig)
                      throws BackEndException
Create a new DocumentEnum and initialize it with a DocConfig, which tells it how to parse the documents
Parameters:
docConfig - the DocConfig with which to initialize

isEmpty

public boolean isEmpty()
Whether the DocumentBuilder has run out of Lucene Documents to return.
Returns:
whether the DocumentBuilder has run out of Lucene Documents to return.

getDocConfig

public DocConfig getDocConfig()
Return the DocConfig used for initializing
Returns:
the DocConfig used for initializing

getDocuments

public final com.lucene.document.Document[] getDocuments()
                                                  throws BackEndException
Load, parse, and return a batch of Lucene Documents from the underlying collection.
Returns:
another batch of Lucene Documents, or null if the DocumentEnum has run out of Documents.
Throws:
BackEndException - if something goes wrong

createDocument

public abstract com.lucene.document.Document createDocument(java.io.File file,
                                                            org.omg.CORBA.IntHolder storeTokenCountHere)
                                                     throws BackEndException
The abstract method specifying how an incoming file is actually parsed into a Lucene Document. An implementation of this method ought to make use of the parseDate() and makeValue() methods.
Parameters:
file - the File to turn into a Lucene Document
storeTokenCountHere - an OUT parameter; an implementor of this method should write the number of tokens in the file into the value field of this IntHolder
Returns:
a Lucene Document generated from the file
Throws:
BackEndException - if something goes wrong

parseDate

public final long parseDate(java.lang.String dateString)
                     throws BackEndException
A helper method for implementations of createDocument(). Turns an incoming String into a numerical format that Lucene can understand. This should be applied to all Strings found in a file being parsed that are going into date fields.
Parameters:
dateString - the String storing some kind of date
Returns:
a numerical encoding of this that Lucene can understand

makeValue

public java.lang.String makeValue(java.lang.String val)
A helper method for implementations of createDocument(). Will build a String from the incoming parameters, and will replace <, >, and & characters with their entity names so they do not form illegal XML expressions. Apply this to nearly every String you read from the the file!
Parameters:
val - the String to clean up
Returns:
a String whose illegal XML entities have been replaced by encodings.

SDARTS

Sdarts Homepage