Parallel

Scott Alexander
Advisor: Dr. Nancy McCracken

A 1994 NPAC REU Project


Abstract

A parallel system for finding and indexing key terms in a document is described. With the constantly increasing amount and availabilily of information, a reliable and efficient method of document retrieval is quickly becoming a necessity for research. DR-LINK is one such document retrieving tool that is able to return a list of relevant documents to a user's query. One way that this is accomplished is by comparing the key terms in the query to the key terms in the documents, a process called indexing. It is shown that the efficiency of indexing terms in increased by developing a parallel implementation of the existing code. Finally, an addition to the indexing module which will give a better word-based representation of the text is discussed.

The Document Retrieval System DR-LINK


The Process of Indexing

The process of identifying key terms in the text and storing these key terms in some structure in order to get a better word-based representation of the text is called indexing . There are three steps in the indexing process:
  1. Token Extraction
  2. Sorting and Merging
  3. TFIDF Value Computation
A vector can be created from this information for each document. The relevance of a document to a query will be determined from the relationship between the query's vector and a document's vector.

The Parallel Implementation

  1. Token Extraction
  2. Sorting and Merging
  3. TFIDF Value Computation
The parallel implementation was ported to the IBM SP-1 using EUI message passing. The key issues in this parallel implementation are load balancing and parallel file I/O. If these two factors are controlled, the parallel version should show marked improvement over the sequential version.

Head Modifier Constructs

Click here to see results of token extraction on a sample of text .
Scott L. Alexander ; Research Apprentice, 1994 NPAC REU Program; email: salexand@npac.syr.edu. Dual Major in Computer Science and Mathematics, St. Bonaventure University; email: alexande@sbu.edu.

Nancy J. McCracken: Project Leader, NPAC, Syracuse University; email: njm@npac.syr.edu.