Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

NLP - natural language regex-collections?

by erix (Prior)
on Oct 16, 2004 at 22:03 UTC ( #399831=perlquestion: print w/replies, xml ) Need Help??

erix has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I am going to make a regex collection for capturing specific (english) language constructs. These can then be used to parse/index/search texts. If such a regex-collection is large and general enough, it should be possible to collect and organise them without knowing the precise form of the text beforehand. My experience with science-like articles (which are the target) is that the text and style are often repetitive, almost monotonous (not meant negatively here).

My question is: would something like a Natural Language regex collection already be in existence? I know Regexp::Common &c, but they all seem to be very much more specialized than what I was hoping to find.

I'd be thankful for pointers or further ideas.

  • Comment on NLP - natural language regex-collections?

Replies are listed 'Best First'.
Re: NLP - natural language regex-collections?
by Zaxo (Archbishop) on Oct 16, 2004 at 22:10 UTC

    Take a look at the Lingua namespace on CPAN.

    After Compline,

Re: NLP - natural language regex-collections?
by perlcapt (Pilgrim) on Oct 17, 2004 at 00:54 UTC
    I have played with something similar while developing a ship command and navigation simulator. (Eventually to have voice recognition and generation I/O, but currently just text based.) The experience that I'm drawing on is a CAI (Computer Aided Instruction) system that was the rage in the 70's: Plato V.

    The problem which they solved was interpretation of free form text into logical relationships of key words. Essentially a thesaurus that worked from many to one. The variety of logical statements that might be recognized were written with the key words. The free text was parsed into key words.

    This was amazingly effective. Uncanny for the users. The implementation is simple in Perl, using it text parsing power and hashes. I'll dig around and see what Perl I have for this.


    I just started looking at the Lingua:: modules. There is a lot there. It certainly is a good place to start. Anyone have any experience with these modules?
      Thesaurus mapping many to one. That is indeed where I expect the best possibilities. A thesaurus that includes multiword phrases, up to sentences. I was thinking of a database that just stores all sentences it encounters, minus some pre-storage streamlining via stemming and problem domain jargon identification.

      But I know from experience that it is easier to talk about it than to implement useful code :)

      I must look better at the Lingua:: stuff; it will take some time. It seems that most of it is word-, not phrase- or sentence-based (as I was hoping).

      Thanks. I will let you know what I find.
Re: NLP - natural language regex-collections?
by kvale (Monsignor) on Oct 17, 2004 at 00:58 UTC
    English grammar is far more complex than the languages spanned by any small set of regexes, so I suspect that you will not find the precise set you are looking for. There are some modules that look at coarse text structure, such as Text::Sentence which would be useful in preporcessing your data.

    Your best bet is probably to study some example scientific prose that you are interested in and identify a small set of patterns that work for you. Then distill regexes to fit those and only those.

    Most information retrieval algorithms focus on keywords and that may be good enough for your app; consider this option first. Keywords are much easier to parse than phrases or sentences. They are the simplest if you want to get something up and running quickly.

    There is a branch of computational linguistics called text summarization and there is quite a bit of work in the machine intelligence community devoted to extracting essential content automatically from text. These programs are big, expensive and many man-years of work in the making.


      A simple keyword approach I have implemented and it works well. But that approach only goes so far.

      The more high-level/multiword/sentence-based my regexes become, the more they leave my specific problem domain, and slide towards the general language domain. This effect surely must be encountered by all language-program programmers who capture text. One would hope eventually to find one anothers' regexes useful, at least at some point to a degree.

      Yes, there are no doubt big expensive programs, but I am doing this for myself. No budget, I am afraid. I have many etexts; I want them accessible on a conceptual level.


Re: NLP - natural language regex-collections?
by pmtolk (Acolyte) on Oct 17, 2004 at 10:15 UTC
    I think you might benefit from stemming
      Stemming certainly must be fitted in at some point. It would be part of above-mentioned streamlining while gathering regexes from real-life, published sentences.

      Thanks for those links. I still have to study on several fronts, I'm afraid...
Re: NLP - natural language regex-collections?
by dragonchild (Archbishop) on Oct 17, 2004 at 14:49 UTC
    I'd like to point you to the discussion at X-Prize: Natural Language Processing.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: NLP - natural language regex-collections?
by hsmyers (Canon) on Oct 17, 2004 at 14:36 UTC
    You should get a copy of The General Inquirer: A Computer Approach to Content Analysis by Philip J. Stone, Dexter C. Dunphy, Marshall S. Smith, and Daniel M. Ogilvie. MIT Press, LCCN 66-22541 published in 1966. It describes software that pretty much does what you are talking about. You might also try googling with this as a starting point to find more current projects. In general a great deal of similar work has been done not in the computer department, but rather on the humanities side of campus. In fact a journal you might want to read is Computers and the Humanities (or similar, it has been a long time) since it dealt specifically with content analysis. Likewise googling with search terms of +COMPUTER and +CONCORDANCE should be interesting as well.


    "Never try to teach a pig to wastes your time and it annoys the pig."
Re: NLP - natural language regex-collections?
by perlcapt (Pilgrim) on Oct 17, 2004 at 17:34 UTC
    I couldn't find the stuff I had done before, probably was pretty sad Perl since I wrote it back '96. Here, however is some stuff to start with that I hacked just now. (I'll continue with this, but probably in my own direction.) The thesaurus file is merely a list of keywords followed by synonyms. The newWords should probably be appended to the thesaurus file with an appropriate # comment for each word.

    A thought, it might be possible to get some first take synonym solutions by parsing the returned text from a dictionary site.

    #!/usr/bin/perl -w use strict; use warnings; my $thesaurusPath = ""; my $thesaurusFile = "thesaurus.nav"; my $ignore = '//'; # key to ignore in the thesaurus file my $thesaurus = {}; my $newWords = {}; # read in the thesaurus word list open CMDS, "<${thesaurusPath}${thesaurusFile}" or die "cannot open \"$thesaurusFile\" for reading"; my $line; while($line = <CMDS>){ chomp($line); # remove comments and leading and trailing space $line =~ s/\s*\#.*$//; $line =~ s/^\s+//; $line =~ s/\s+$//; next if not length($line); # uppercase only $line =~ tr/a-z/A-Z/; # break the list apart and stash it my @words = split(/\s/,$line); # key word to which the others resolve is $words[0], the first one for(@words) { $thesaurus->{$_} = $words[0]; } } close CMDS; # now lets see what results with rewriting print "> "; while(<>) { my @result = (); my @words = (); chomp; tr/a-z/A-Z/; # uppercase only last if /^\s*$/; # end if no input @words = split; for(@words){ if(not defined $thesaurus->{$_}){ # increment or create entry for new word if(defined $newWords->{$_}) { ++ $newWords->{$_}; } else { $newWords->{$_} = 1; } push(@result,"?$_?"); # flag it in output }else{ next if $thesaurus->{$_} eq $ignore; push(@result,$thesaurus->{$_}); } } print join(" ",@result),"\n"; print "> "; } print "These words were not recognized:\n"; for (keys %$newWords) { print "$_\t\t$newWords->{$_}\n"; } exit;
Re: NLP - natural language regex-collections? - Lingua
by erix (Prior) on Oct 19, 2004 at 18:18 UTC
Re: NLP - natural language regex-collections?
by allolex (Curate) on Oct 19, 2004 at 22:56 UTC

    Hi Eric. You might consider looking into Andrei Mikheev's article on text segmentation in Handbook of Computational Linguistics and the chapter on parsing in the same book.

    If you can give me some concrete examples of what you are looking to do, I might be able to scare up some info for you. I have to say that regular expressions are often not the best way to deal with linguistic data. Perl is also a bit slow for heavy parsing and segmenting -- especially if you use Parse::RecDescent ;) -- but it's definitely a good place to start.

    @INBOOK{mikheev2002text, chapter = {10}, pages = {201-218}, title = {Text Segmentation}, publisher = {Oxford University Press}, year = {2002}, editor = {Ruslan Mitkov}, author = {Andrei Mikheev}, address = {Oxford}, } @BOOK{mitkov2002handbook, title = {Handbook of Computational Linguistics}, publisher = {Oxford University Press}, year = {2002}, editor = {Ruslan Mitkov}, }

    Damon Allen Davison

Re: NLP - natural language regex-collections?
by mattr (Curate) on Oct 20, 2004 at 15:58 UTC
    It sounds like you want to scan for some simple grammatical constructs, like maybe subject-verb-object, etc. Maybe the above links can help you. This is a field where you can get sucked deeper and deeper which is great if you are interested in it. Though I am not a computational linguist by a very long shot, it sounded like you might want to start with a tagger so you can tell what parts of speech you have, also head driven parsers are gaining a lot of attention. There are now a lot more linguistic resources in CPAN than there were just months ago.

    You might like to check out The GATE Project at the University of Sheffield's natural language processing group.

    (GATE = General Architecture for Text Engineering)

    also resource lists from Statistical NLP at Stanford U., Tokushima U., and the NL Software Registry. You will find lots of links if you spend time searching for the phrase in quotes, "Natural Language Processing". or maybe "Information Extraction". Just searching for NLP or IE will not be so useful.

    Incidentally, I don't know if this will help you but if you read the GATE Guide (i.e. the Tao of Gate book), you may find interesting the chapters on the ANNIE information extraction engine and JAPE ("JAPE allows you to recognise regular expressions in annotations on documents"). It likes Java though, if anyone knows about GATE usage with Perl I'm interested in hearing about it.

    How about reporting back on how your work goes?

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://399831]
Approved by Arunbear
Front-paged by Old_Gray_Bear
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2023-05-31 11:07 GMT
Find Nodes?
    Voting Booth?

    No recent polls found