You are here
An intelligent editor for natural language processing of unrestricted text
- Date Issued:
- 1999
- Abstract/Description:
- University of Central Florida College of Arts and Sciences Thesis; The understanding of natural language by computational methods has been a continuing and elusive problem in artificial intelligence. In recent years there has been a resurgence in natural language processing research. Much of this work has been on empirical or corpus-based methods which use a data-driven approach to train systems on large amounts of real language data. Using corpus-based methods, the performance of part-of-speech (POS) taggers, which assign to the individual words of a sentence their appropriate part of speech category (e.g., noun, verb, preposition), now rivals human performance levels, achieving accuracies exceeding 95%. Such taggers have proved useful as preprocessors for such tasks as parsing, speech synthesis, and information retrieval. Parsing remains, however, a difficult problem, even with the benefit of POS tagging. Moveover, as sentence length increases, there is a corresponding combinatorial explosing of alternative possible parses. Consider the following sentence from a New York Times online article: After Salinas was arrested for murder in 1995 and lawyers for the bank had begun monitoring his accounts, his personal banker in New York quietly advised Salinas' wife to move the money elsewhere, apparently without the consent of the legal department. To facilitate the parsing and other tasks, we would like to decompose this sentence into the following three shorter sentences which, taken together, convey the same meaning as the original: 1. Salinas was arrested for murder in 1995. 2. Lawyers for the bank had begun monitoring his accounts. 3. His personal banker in New York quietly advised Salinas' wife to move the money elsewhere, apparently without the consent of the legal department. This study investigates the development of heuristics for decomposing such long sentences into sets of shorter sentences without affecting the meaning of the original sentences. Without parsing or semantic analysis, heuristic rules were developed based on: (1) the output of a POS tagger (Brill's tagger); (2) the punctuation contained in the input sentences; and (3) the words themselves. The heuristic algorithms were implemented in an intelligent editor program which first augmented the POS tags and assigned tags to punctuation, and then tested the rules against a corpus of 25 New York Times online articles containing approximately 1,200 sentences and over 32,000 words, with good results. Recommendations are made for improving the algorithms and for continuing this line of research.
Title: | An intelligent editor for natural language processing of unrestricted text. |
42 views
14 downloads |
---|---|---|
Name(s): |
Glinos, Demetrios George, Author Gomez, Fernando, Committee Chair Arts and Sciences, Degree Grantor |
|
Type of Resource: | text | |
Date Issued: | 1999 | |
Publisher: | University of Central Florida | |
Language(s): | English | |
Abstract/Description: | University of Central Florida College of Arts and Sciences Thesis; The understanding of natural language by computational methods has been a continuing and elusive problem in artificial intelligence. In recent years there has been a resurgence in natural language processing research. Much of this work has been on empirical or corpus-based methods which use a data-driven approach to train systems on large amounts of real language data. Using corpus-based methods, the performance of part-of-speech (POS) taggers, which assign to the individual words of a sentence their appropriate part of speech category (e.g., noun, verb, preposition), now rivals human performance levels, achieving accuracies exceeding 95%. Such taggers have proved useful as preprocessors for such tasks as parsing, speech synthesis, and information retrieval. Parsing remains, however, a difficult problem, even with the benefit of POS tagging. Moveover, as sentence length increases, there is a corresponding combinatorial explosing of alternative possible parses. Consider the following sentence from a New York Times online article: After Salinas was arrested for murder in 1995 and lawyers for the bank had begun monitoring his accounts, his personal banker in New York quietly advised Salinas' wife to move the money elsewhere, apparently without the consent of the legal department. To facilitate the parsing and other tasks, we would like to decompose this sentence into the following three shorter sentences which, taken together, convey the same meaning as the original: 1. Salinas was arrested for murder in 1995. 2. Lawyers for the bank had begun monitoring his accounts. 3. His personal banker in New York quietly advised Salinas' wife to move the money elsewhere, apparently without the consent of the legal department. This study investigates the development of heuristics for decomposing such long sentences into sets of shorter sentences without affecting the meaning of the original sentences. Without parsing or semantic analysis, heuristic rules were developed based on: (1) the output of a POS tagger (Brill's tagger); (2) the punctuation contained in the input sentences; and (3) the words themselves. The heuristic algorithms were implemented in an intelligent editor program which first augmented the POS tags and assigned tags to punctuation, and then tested the rules against a corpus of 25 New York Times online articles containing approximately 1,200 sentences and over 32,000 words, with good results. Recommendations are made for improving the algorithms and for continuing this line of research. | |
Identifier: | CFR0008181 (IID), ucf:53055 (fedora) | |
Note(s): |
1999-08-01 M.S. Computer Science Masters This record was generated from author submitted information. Electronically reproduced by the University of Central Florida from a book held in the John C. Hitt Library at the University of Central Florida, Orlando. |
|
Subject(s): |
Arts and Sciences -- Dissertations Academic Computational linguistics Dissertations Academic -- Arts and Sciences Artificial intelligence |
|
Persistent Link to This Record: | http://purl.flvc.org/ucf/fd/CFR0008181 | |
Restrictions on Access: | public | |
Host Institution: | UCF |