Overview
The SDARTS Automatic Content Summary Extraction application permits the creation of estimated content summaries for SDARTS remote web sources (see paper). The application J2EE compliant. It can be installed on any J2EE compliant application server. We provide a web application archive file("war" file) to be installed as described below. R, a language and environment for statistical computing and graphics, is required, but not provided.We use documents sampling approach employing a small number of short topically focused query probes. These probes are contained in the hierarchy directory included in the distribution. This application performs the sampling normalizes the results and places an SDARTS content summary in the appropriate configuration directory of the SDARTS server.
Installation
The distribution
The distribution includes consists of a directory csextraction containing:
Installation Instructions
Deploy in accordance with the instructions for your application server the csextraction.war file.
In the root directory for the deployed application you will find the csextraction.properties file. Edit the file to reflect the locations of the various directories
#This contains the configuration parameters for the Automatic Content Summary Extraction All of the paths should be absolute. #The path to the SDARTS server configuration directory defaultconfigpath=e:/documents/sdarts/config/ #The path where this application will store intermediate profiles #and find the patch file to run R-Project workingpath = E:/Documents/working/ #The url for the query interface of the SDARTS server sdartsurl=http://db.cs.columbia.edu:8080/sdarts
Include not only the path but the hierarchy file name. This permits you to incorporate alternative classifiers
#The full path to the hierarchy file The individual classifier files should be in #a directory named classifiers rooted in the same directory as the hierarchy file hierarchyfile =E:/Documents/hierarchy/hierarchy-svm.txt
The doc sample size indicates the number of articles retrieved. We have achieved good results with 4. Increasing this number increases memory consumption as well as time required to create a summary.
#The number of article retrieved and fully analyzed for each query in the extraction process docsamplesize=4
If true the application executes the .bat files to run R-project software otherwise it uses the .sh files in the working directory.
#indicates whether the system is Windows(true). False indicates linux/unix iswindows=true
After restarting your server the application should be ready to use.