Scott Alexander
Advisor: Dr. Nancy McCracken
Abstract
A parallel system for finding and indexing key terms in a document is
described. With the constantly increasing amount and availabilily of
information, a reliable and efficient method of document retrieval is quickly
becoming a necessity for research. DR-LINK is one such document retrieving
tool that is able to return a list of relevant documents to a user's query.
One way that this is accomplished is by comparing the key terms in the query
to the key terms in the documents, a process called indexing. It is shown
that the efficiency of indexing terms in increased by developing a parallel
implementation of the existing code. Finally, an addition to the indexing
module which will give a better word-based representation of the text is
discussed.
The Document Retrieval System DR-LINK
- DR-LINK = Document Retrieval using Linguistic Knowledge
- Developed by the School of Information Studies at Syracuse University
- Processes and tests documents and queries as formatted and distributed
by the Text Retrieval Conferences (TREC) project which is funded by
the Advanced Research Projects Agency (ARPA)
- DR-LINK's goal: to represent and match documents and queries at the
various linguistic levels at which human language conveys meaning
- Processes documents and queries in stages; it is not a single-step full
parse
- Has separate modules which can process and represent the text in different
ways
- This allows DR-LINK to detect concepts and relations in the text at the
different levels of human language
- Indexing detects the lexical, or word, relations between query and
document
The Process of Indexing
The process of identifying key terms in the text and storing these key terms
in some structure in order to get a better word-based representation of the
text is called indexing . There are three steps in the indexing
process:
- Token Extraction
- The text is scanned for key terms, and there are four types of lexical
entities extracted as key terms:
- Proper Nouns
- Proper Noun categories (i.e. Bill Clinton --> person )
- Complex Nominals -- phrases (i.e. information processing)
- Single Words -- nouns, verbs, and adjectives
- The name of the term, the name of the document the term occurs in, and
the number of times the term occurs in that document are stored in an
index table.
- Sorting and Merging
- The index table is sorted alphabetically, and terms with the same name
are merged into a single data structure.
- TFIDF Value Computation
- A TFIDF (Term Frequency / Inverted Document Frequency) value is computed
for each term. This value will be a "weight" for a term in a particular
document.
A vector can be created from this information for each document. The
relevance of a document to a query will be determined from the relationship
between the query's vector and a document's vector.
The Parallel Implementation
- Token Extraction
- Each processor will process a block of documents in a file
- Each processor will write an extracted token, the document the token is
found in, and the number of times the term occurs in that document to
an output file
- There will be 16 output files for each processor separated alphabetically
- Sorting and Merging
- Each processor takes a block of files to sort
- Token extraction writes to 16 files alphabetically. These 16 output files
should give a uniform distribution of terms among them: Go.tfd Nn.tfd
Yd.tfd as.tfd cg.tfd d_.tfd ex.tfd gn.tfd in.tfd m_.tfd nt.tfd pp.tfd
re.tfd so.tfd to.tfd zz.tfd
- N processors can sort these files sequentially, and the run-time
should be N times faster than the run-time of a sequential
sort on one large output file
- Terms with the same name are then merged into a variant record for each
unique term
- TFIDF Value Computation
- Again, each node will process a group of files that result from merging.
- This part takes two passes to compute the TFIDF value for each term
- A variant record hold the information on each term: name of the term,
number of documents the term occurs in, name of each document the term
occurs in, and TFIDF value for the term in each document
The parallel implementation was ported to the IBM SP-1 using EUI message
passing. The key issues in this parallel implementation are load balancing
and parallel file I/O. If these two factors are controlled, the parallel
version should show marked improvement over the sequential version.
Head Modifier Constructs
- Phrases give the best word-based representation of the text. Single
words are not accurate enough.
- There is more than one way to extract a meaningful term from the text.
Consider these examples:
- orchestra director
- director of the orchestra
Both phrases contain a form of director + orchestra
- In the present indexing module, example 1 is extracted as a key term, but
in example two director and orchestra are extracted as
single terms. By extracting a key term from constructs such as example
2, it was speculated that a better word-based representation of the text
would result.
- Results of this implementation:
- test data: 2 days of Wall Street Journals
- number of articles: 86
- head modifier constructs extracted: 636
- meaningful head modifier constructs extracted: 481 (75.6%)
- meaningful head modifiers per article: 5.59
- complex nominals, proper nouns, and proper noun categories extracted: 3874
- complex nominals, proper nouns, and proper noun categories per article: 45.05
- If there are 15 complex nominals and 15 proper nouns extracted per
article, and these are the most meaningful key terms extracted, then
by adding almost 6 more meaningful terms per article from this head
modifier construct would surely give a better word-based representation
of the text.
Click here to see results of token extraction
on a sample of text .
Scott L. Alexander ; Research Apprentice, 1994 NPAC REU Program;
email: salexand@npac.syr.edu. Dual Major in Computer Science and Mathematics,
St. Bonaventure University; email: alexande@sbu.edu.
Nancy J. McCracken: Project Leader, NPAC, Syracuse University;
email: njm@npac.syr.edu.