You are here

An intelligent editor for natural language processing of unrestricted text

Download pdf | Full Screen View

Date Issued:
1999
Abstract/Description:
University of Central Florida College of Arts and Sciences Thesis; The understanding of natural language by computational methods has been a continuing and elusive problem in artificial intelligence. In recent years there has been a resurgence in natural language processing research. Much of this work has been on empirical or corpus-based methods which use a data-driven approach to train systems on large amounts of real language data. Using corpus-based methods, the performance of part-of-speech (POS) taggers, which assign to the individual words of a sentence their appropriate part of speech category (e.g., noun, verb, preposition), now rivals human performance levels, achieving accuracies exceeding 95%. Such taggers have proved useful as preprocessors for such tasks as parsing, speech synthesis, and information retrieval. Parsing remains, however, a difficult problem, even with the benefit of POS tagging. Moveover, as sentence length increases, there is a corresponding combinatorial explosing of alternative possible parses. Consider the following sentence from a New York Times online article: After Salinas was arrested for murder in 1995 and lawyers for the bank had begun monitoring his accounts, his personal banker in New York quietly advised Salinas' wife to move the money elsewhere, apparently without the consent of the legal department. To facilitate the parsing and other tasks, we would like to decompose this sentence into the following three shorter sentences which, taken together, convey the same meaning as the original: 1. Salinas was arrested for murder in 1995. 2. Lawyers for the bank had begun monitoring his accounts. 3. His personal banker in New York quietly advised Salinas' wife to move the money elsewhere, apparently without the consent of the legal department. This study investigates the development of heuristics for decomposing such long sentences into sets of shorter sentences without affecting the meaning of the original sentences. Without parsing or semantic analysis, heuristic rules were developed based on: (1) the output of a POS tagger (Brill's tagger); (2) the punctuation contained in the input sentences; and (3) the words themselves. The heuristic algorithms were implemented in an intelligent editor program which first augmented the POS tags and assigned tags to punctuation, and then tested the rules against a corpus of 25 New York Times online articles containing approximately 1,200 sentences and over 32,000 words, with good results. Recommendations are made for improving the algorithms and for continuing this line of research.
Title: An intelligent editor for natural language processing of unrestricted text.
42 views
14 downloads
Name(s): Glinos, Demetrios George, Author
Gomez, Fernando, Committee Chair
Arts and Sciences, Degree Grantor
Type of Resource: text
Date Issued: 1999
Publisher: University of Central Florida
Language(s): English
Abstract/Description: University of Central Florida College of Arts and Sciences Thesis; The understanding of natural language by computational methods has been a continuing and elusive problem in artificial intelligence. In recent years there has been a resurgence in natural language processing research. Much of this work has been on empirical or corpus-based methods which use a data-driven approach to train systems on large amounts of real language data. Using corpus-based methods, the performance of part-of-speech (POS) taggers, which assign to the individual words of a sentence their appropriate part of speech category (e.g., noun, verb, preposition), now rivals human performance levels, achieving accuracies exceeding 95%. Such taggers have proved useful as preprocessors for such tasks as parsing, speech synthesis, and information retrieval. Parsing remains, however, a difficult problem, even with the benefit of POS tagging. Moveover, as sentence length increases, there is a corresponding combinatorial explosing of alternative possible parses. Consider the following sentence from a New York Times online article: After Salinas was arrested for murder in 1995 and lawyers for the bank had begun monitoring his accounts, his personal banker in New York quietly advised Salinas' wife to move the money elsewhere, apparently without the consent of the legal department. To facilitate the parsing and other tasks, we would like to decompose this sentence into the following three shorter sentences which, taken together, convey the same meaning as the original: 1. Salinas was arrested for murder in 1995. 2. Lawyers for the bank had begun monitoring his accounts. 3. His personal banker in New York quietly advised Salinas' wife to move the money elsewhere, apparently without the consent of the legal department. This study investigates the development of heuristics for decomposing such long sentences into sets of shorter sentences without affecting the meaning of the original sentences. Without parsing or semantic analysis, heuristic rules were developed based on: (1) the output of a POS tagger (Brill's tagger); (2) the punctuation contained in the input sentences; and (3) the words themselves. The heuristic algorithms were implemented in an intelligent editor program which first augmented the POS tags and assigned tags to punctuation, and then tested the rules against a corpus of 25 New York Times online articles containing approximately 1,200 sentences and over 32,000 words, with good results. Recommendations are made for improving the algorithms and for continuing this line of research.
Identifier: CFR0008181 (IID), ucf:53055 (fedora)
Note(s): 1999-08-01
M.S.
Computer Science
Masters
This record was generated from author submitted information.
Electronically reproduced by the University of Central Florida from a book held in the John C. Hitt Library at the University of Central Florida, Orlando.
Subject(s): Arts and Sciences -- Dissertations
Academic
Computational linguistics
Dissertations
Academic -- Arts and Sciences
Artificial intelligence
Persistent Link to This Record: http://purl.flvc.org/ucf/fd/CFR0008181
Restrictions on Access: public
Host Institution: UCF

In Collections