NLP - natural language regex-collections?

http://qs321.pair.com?node_id=399831

erix has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I am going to make a regex collection for capturing specific (english) language constructs. These can then be used to parse/index/search texts. If such a regex-collection is large and general enough, it should be possible to collect and organise them without knowing the precise form of the text beforehand. My experience with science-like articles (which are the target) is that the text and style are often repetitive, almost monotonous (not meant negatively here).

My question is: would something like a Natural Language regex collection already be in existence? I know Regexp::Common &c, but they all seem to be very much more specialized than what I was hoping to find.

I'd be thankful for pointers or further ideas.

Comment on NLP - natural language regex-collections?

Replies are listed 'Best First'.

Re: NLP - natural language regex-collections?
by Zaxo (Archbishop) on Oct 16, 2004 at 22:10 UTC

Take a look at the Lingua namespace on CPAN.

After Compline,
Zaxo

Re: NLP - natural language regex-collections?
by perlcapt (Pilgrim) on Oct 17, 2004 at 00:54 UTC

The problem which they solved was interpretation of free form text into logical relationships of key words. Essentially a thesaurus that worked from many to one. The variety of logical statements that might be recognized were written with the key words. The free text was parsed into key words.

This was amazingly effective. Uncanny for the users. The implementation is simple in Perl, using it text parsing power and hashes. I'll dig around and see what Perl I have for this.

Update:

Re^2: NLP - natural language regex-collections?

by erix (Prior) on Oct 17, 2004 at 07:21 UTC

I

Re: NLP - natural language regex-collections?
by kvale (Monsignor) on Oct 17, 2004 at 00:58 UTC

Your best bet is probably to study some example scientific prose that you are interested in and identify a small set of patterns that work for you. Then distill regexes to fit those and only those.

Most information retrieval algorithms focus on keywords and that may be good enough for your app; consider this option first. Keywords are much easier to parse than phrases or sentences. They are the simplest if you want to get something up and running quickly.

There is a branch of computational linguistics called text summarization and there is quite a bit of work in the machine intelligence community devoted to extracting essential content automatically from text. These programs are big, expensive and many man-years of work in the making.

-Mark

Re^2: NLP - natural language regex-collections?

by erix (Prior) on Oct 17, 2004 at 07:55 UTC

~~at some point~~

Re: NLP - natural language regex-collections?
by pmtolk (Acolyte) on Oct 17, 2004 at 10:15 UTC

I think you might benefit from stemming http://www.comp.lancs.ac.uk/computing/research/stemming/general/index.htm http://www.perlmonks.org/?node_id=175245

Re^2: NLP - natural language regex-collections?

by erix (Prior) on Oct 17, 2004 at 11:54 UTC

Re: NLP - natural language regex-collections?
by dragonchild (Archbishop) on Oct 17, 2004 at 14:49 UTC

X-Prize: Natural Language Processing

Being right, does not endow the right to be rude; politeness costs nothing.
Being unknowing, is not the same as being stupid.
Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: NLP - natural language regex-collections?
by hsmyers (Canon) on Oct 17, 2004 at 14:36 UTC

The General Inquirer: A Computer Approach to Content Analysis

Computers and the Humanities

--hsm

"Never try to teach a pig to sing...it wastes your time and it annoys the pig."

Re: NLP - natural language regex-collections?
by perlcapt (Pilgrim) on Oct 17, 2004 at 17:34 UTC

A thought, it might be possible to get some first take synonym solutions by parsing the returned text from a dictionary site.

#!/usr/bin/perl -w
use strict;
use warnings;

my $thesaurusPath = "";
my $thesaurusFile = "thesaurus.nav";
my $ignore = '//'; # key to ignore in the thesaurus file
my $thesaurus = {};
my $newWords = {};

# read in the thesaurus word list
open CMDS, "<${thesaurusPath}${thesaurusFile}" or die
    "cannot open \"$thesaurusFile\" for reading";
my $line;
while($line = <CMDS>){
    chomp($line);
    # remove comments and leading and trailing space
    $line =~ s/\s*\#.*$//;
    $line =~ s/^\s+//;
    $line =~ s/\s+$//;
    next if not length($line);

    # uppercase only
    $line =~ tr/a-z/A-Z/;
    
    # break the list apart and stash it
    my @words = split(/\s/,$line);
        
    # key word to which the others resolve is $words[0], the first one
    for(@words) {
        $thesaurus->{$_} = $words[0];
    }
}
close CMDS;

# now lets see what results with rewriting
print "> ";
while(<>) {
    
    my @result = ();
    my @words = ();

    chomp;
    
    tr/a-z/A-Z/; # uppercase only

    last if /^\s*$/; # end if no input
    
    @words = split;
    for(@words){
        if(not defined $thesaurus->{$_}){
            # increment or create entry for new word
            if(defined $newWords->{$_}) { ++ $newWords->{$_}; }
            else { $newWords->{$_} = 1; }
            
            push(@result,"?$_?"); # flag it in output
        }else{
            next if $thesaurus->{$_} eq $ignore;
            push(@result,$thesaurus->{$_});
        }
    }
    print join(" ",@result),"\n";
    print "> ";
}

print "These words were not recognized:\n";
for (keys %$newWords) {
    print "$_\t\t$newWords->{$_}\n";
}
exit;
[download]

Re: NLP - natural language regex-collections? - Lingua
by erix (Prior) on Oct 19, 2004 at 18:18 UTC

I looked at all Lingua module docs to find the ones that can be useful in the context of this thread: parsing or generating (english) language constructs.

I have excluded all modules for other languages than english, french, or german.

One module stands out from the others: Lingua::LinkParser is a wrapper for the LINK parser (downloadable here code included), which is a parser written in C, and has an API, which is used by the perl module. I haven't yet used the wrapper but did install the parser itself, and compiled it without problem on win2k with vc6. It has a shell which is easy to get started, and parsing seems very advanced (first impression).

This is a work in progress; I'll continue adding to it, as these and other modules are examined. (Regexp::, Parser::, etc. will follow)

Language level

Lingua::Ident Statistical language identification

Lingua::Identify Language identification

Lingua::Preferred Pick a language based on user's preferences

Phrase/sentence/syntax level

Lingua::CollinsParser Head-driven syntactic sentence parser

Lingua::CollinsParser::Node Syntax tree node

Lingua::Conjunction Convert lists into conjunctions

Lingua::EN::Sentence Module for splitting text into sentences.

Lingua::EN::Splitter Split text into words, paragraphs, segments, and tiles

Lingua::EN::Squeeze Shorten english text for Pagers/GSM phones

Lingua::LinkParser Link Grammar Parser by Sleator, Temperley and Lafferty at CMU

Lingua::LinkParser::Definitions Extension providing text definitions for link types

Lingua::LinkParser::Dictionary

Lingua::LinkParser::Linkage

Lingua::LinkParser::Linkage::Sublinkage

Lingua::LinkParser::Linkage::Sublinkage::Link

Lingua::LinkParser::Linkage::Word

Lingua::LinkParser::MatchPath Match paths in linkage diagrams

Lingua::LinkParser::MatchPath::BuildSM

Lingua::LinkParser::MatchPath::Lex

Lingua::LinkParser::MatchPath::Parser

Lingua::LinkParser::MatchPath::SM

Lingua::LinkParser::MatchPath::SMContext

Lingua::LinkParser::Sentence

Lingua::LinkParser::Simple Perl extension for Link Parser - incomplete access to API

Lingua::EN::Segmenter Subdivide texts into passages that represent subtopics

Lingua::EN::Segmenter::Baseline Segment text randomly for baseline purposes

Lingua::EN::Segmenter::Evaluator Evaluate a segmenting method

Lingua::EN::Segmenter::TextTiling Segment text using the TextTiling method

Lingua::EN::Summarize::Filters Helper functions for the Summarize module

Lingua::EN::Summarize A simple tool for summarizing bodies of English text.

Lingua::EN::Summarize::Filters Helper functions for the Summarize module

Lingua::EN::Tagger Part-of-speech tagger for English natural language processing.

Word level

Lingua::DE::ASCII Perl extension to convert german umlauts to and from ascii

Lingua::EN::StopWords Typical stop words for an English corpus

Lingua::EN::AddressGrammar grammar tree for Lingua::EN::AddressParse

Lingua::EN::AddressParse Manipulate geographical addresses

Lingua::EN::Dict BETA Version of XML english dictionary storage.

Lingua::EN::Fathom Readability measurements for English text

Lingua::EN::FindNumber Locate (written) numbers in English text

Lingua::EN::Gender Inflect pronouns for gender

Lingua::EN::Hyphenate Syllable based hyphenation

Lingua::EN::Infinitive Find infinitive of a conjugated word

Lingua::EN::Inflect English sing->plur, a/an, nums, participles

Lingua::EN::Inflect::Number Force number of words to singular or plural

Lingua::EN::Keywords Automatically extracts keywords from text

Lingua::EN::Tagger Part-of-speech tagger for English natural language processing.

Lingua::EN::Syllable Estimate syllable count in words

Lingua::EN::VerbTense

Lingua::Ispell Interface to the Ispell spellchecker

Lingua::LA::Stemmer Stemmer for Latin

Lingua::Lexicon::IDP OOP methods for Internet Dictionary Project

Human names

Lingua::EN::MatchNames Smart matching for human names

Lingua::EN::Nickname Genealogical nickname matching(Peggy=Midge)

Lingua::EN::NameCase Convert NAMES and names to Correct Case

Lingua::EN::Namegame Converts name to verse as in Name Game song

Lingua::EN::NamedEntity Basic Named Entity Extraction algorithm

Lingua::EN::NameGrammar grammar tree for Lingua::EN::NameParse

Lingua::EN::NameLookup a simple dictionary search and manipulation class.

Lingua::EN::NameParse Manipulate persons name

Numbers h

Lingua::31337 P3RL M0DU1E 7O c0NVer7 7ext 7O C0o1 741k

Lingua::DE::Num2Word positive number to text convertor for german. Output

Lingua::DE::Sentence Perl extension for tokenizing german texts into their sentences.

Lingua::EN::Nums2Words

Lingua::EN::Numbers Converts numeric values into their English string equivalents.

Lingua::EN::WordsToNumbers convert numbers written in English to actual numbers

Lingua::EN::Numbers Converts numeric values into their English string equivalents.

Lingua::EN::Numbers::Easy Hash access to Lingua::EN::Numbers objects.

Lingua::EN::Numbers::Ordinate go from cardinal (53) to ordinal (53rd)

Lingua::EN::Numericalize Replaces English descriptions of numbers with numerals

Lingua::EN::Nums2Words

Lingua::EN::Words2Nums convert English text to numbers

Lingua::EN::WordsToNumbers convert numbers written in English to actual numbers

Lingua::FR::Nums2Words Converts numbers to French words

Lingua::Num2Word wrapper for number to text conversion modules of

Lingua::Alignment stuff I think it does alignment of two texts in different languages

Lingua::Alignment

Lingua::AlignmentEval

Lingua::AlignmentSet handle a word-aligned bilingual corpus

Lingua::AlignmentSlice

Lingua::Features stuff. I think it is a framework for language description (completely 'meta'; no implementation)

Lingua::Features Natural languages features

Lingua::Features::Feature Feature object for Lingua::Features

Lingua::Features::FeatureType FeatureType object for Lingua::Features

Lingua::Features::Library Features library object for Lingua::Features

Lingua::Features::Structure Structure object for Lingua::Features

Lingua::Features::StructureType StructureType object for Lingua::Features

Lingua::Features::Tag Tag object for Lingua::Features

Lingua::Features::Type Type object for Lingua::Features

Lingua::Features::Value Value object for Lingua::Features

Other stuff (not useful for above-mentioned purpose):

Read more... (25 kB)

Re: NLP - natural language regex-collections?
by allolex (Curate) on Oct 19, 2004 at 22:56 UTC

Hi Eric. You might consider looking into Andrei Mikheev's article on text segmentation in Handbook of Computational Linguistics and the chapter on parsing in the same book.

If you can give me some concrete examples of what you are looking to do, I might be able to scare up some info for you. I have to say that regular expressions are often not the best way to deal with linguistic data. Perl is also a bit slow for heavy parsing and segmenting -- especially if you use Parse::RecDescent ;) -- but it's definitely a good place to start.

@INBOOK{mikheev2002text,
  chapter = {10},
  pages = {201-218},
  title = {Text Segmentation},
  publisher = {Oxford University Press},
  year = {2002},
  editor = {Ruslan Mitkov},
  author = {Andrei Mikheev},
  address = {Oxford},
}

@BOOK{mitkov2002handbook,
  title = {Handbook of Computational Linguistics},
  publisher = {Oxford University Press},
  year = {2002},
  editor = {Ruslan Mitkov},
}
[download]

--
Damon Allen Davison
http://www.allolex.net

Re: NLP - natural language regex-collections?
by mattr (Curate) on Oct 20, 2004 at 15:58 UTC

You might like to check out The GATE Project at the University of Sheffield's natural language processing group.

(GATE = General Architecture for Text Engineering)

also resource lists from Statistical NLP at Stanford U., Tokushima U., and the NL Software Registry. You will find lots of links if you spend time searching for the phrase in quotes, "Natural Language Processing". or maybe "Information Extraction". Just searching for NLP or IE will not be so useful.

Incidentally, I don't know if this will help you but if you read the GATE Guide (i.e. the Tao of Gate book), you may find interesting the chapters on the ANNIE information extraction engine and JAPE ("JAPE allows you to recognise regular expressions in annotations on documents"). It likes Java though, if anyone knows about GATE usage with Perl I'm interested in hearing about it.

How about reporting back on how your work goes?

Back to Seekers of Perl Wisdom