Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

arrays of arrays

by monkantar (Initiate)
on Sep 07, 2007 at 13:12 UTC ( [id://637646]=perlquestion: print w/replies, xml ) Need Help??

monkantar has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am a starter in PERL and new to this website. For linguistic research, I should convert textcorpora with data such as:

the/article book/noun he/pronoun is/verb ill/adjective
into:
the book article noun he is ill pronoun verb adjective
I thought writing a script for this function would be a piece of cake, but that wasn't the case...

I wrote a simple toy program with words and numbers to start :
the input is:
book 1 pencil 2 desk 3

the output is:
1 2 3 book pencil desk

input.txt consists of following textline:

book 1 pencil 2 desk 3

the perlscript:

open(MYFILE, ">output.txt") or die ("can not write to file\n"); open(INFILE, "input.txt") or die ("can not open file\n"); $w = "[a-z]"; while($line = <INFILE>) { while($line =~ /(\d+)/g) { push(@nums, $1); } while($line =~ /($w+)/g) { push(@words, $1); } } push(@nums, @words); foreach $token(@nums) { print(MYFILE "$token "); }
This program works only for one line of text. For the moment I am reading about arrays of arrays as I want the program to work for each line in a text and push the selected items at the end of each line but I don't know if this approach will work. Does someone knows... or has written similar programs and can give me some advice for writing such a PERL script?

Thanks a lot!

Best regards

Monkantar

Replies are listed 'Best First'.
Re: arrays of arrays
by dwm042 (Priest) on Sep 07, 2007 at 15:28 UTC
    I've been a little confused about all this too, so I googled textcorpora and came up with this Wikipedia link:

    http://en.wikipedia.org/wiki/Text_corpus

    in which case you can see that the data:

    the/article book/noun he/pronoun is/verb ill/adjective
    are words tagged with the parts of speech they represent. So what he's wanting to do is deconvolute the word/part-of-speech pairs back into sentences followed by the equivalent parts of speech in the same order.

    the book article noun he is ill pronoun verb adjective
    So what he's wanting is a program that would see (\w+)\/(\w+) pairs, split them, push each into an array and once the parse is complete, 'emit' the data in sequential order, first the array of words and second the array of parts of speech. This word-space-number example is just a step on the way to get his textcorpora stuff working.

    That's the explanation; I hope it helps

    This is my code example:

    #!/usr/bin/perl use warnings; use strict; my @words; my @parts_of_speech; while(my $sentence = <DATA>) { @words = (); @parts_of_speech = (); while ($sentence =~ /(\w+)\/(\w+)/g ) { push(@words, $1) if $1; push(@parts_of_speech, $2) if $2; } print $_, " " for @words; print $_, " " for @parts_of_speech; print "\n"; } __DATA__ the/article book/noun he/pronoun is/verb ill/adjective
    The output is:

    C:\Code>perl linguistic.pl the book article noun he is ill pronoun verb adjective
    Update: cleanup
      my @words; my @parts_of_speech; while(my $sentence = <DATA>) { @words = (); @parts_of_speech = (); while ($sentence =~ /(\w+)\/(\w+)/g ) { push(@words, $1) if $1; push(@parts_of_speech, $2) if $2; } print $_, " " for @words; print $_, " " for @parts_of_speech; print "\n"; }

      The arrays  @words and  @parts_of_speech don't need to be in file scope, you should declare them inside the loop.

      The pattern  \w+ will always match at least one character so the only way it can be false is if that one character is  '0' so the tests for  $1 and  $2 are superfluous.

      Your print statements are overly complicated, they could be simplified to:

      print "@words @parts_of_speech\n";
      Dear grep, toolic, Gangabass and dwm04, thanks a lot for your advice!! I realize I still have to learn a lot, and this first visit to this website was very interesting. dwm04, your script does exactly what I want, thx! monkantar
Re: arrays of arrays
by toolic (Bishop) on Sep 07, 2007 at 13:52 UTC
    Your question is a little ambiguous regarding the exact format of your input file, and your desired output, but I refactored your code as shown below:
    > cat input.txt book 1 pencil 2 desk 3 foo 5 bar 6 baz 7 > > cat 637646.pl #!/usr/bin/env perl use warnings; use strict; my (@n, @w); open INFILE, '<', 'input.txt' or die "can not open file $!\n"; while (<INFILE>) { my (@numbers) = /(\d+)/g; my (@words) = /([a-z]+)/g; push @n, @numbers; push @w, @words; } close INFILE; open MYFILE, '>', 'output.txt' or die "can not write to file $!\n"; print MYFILE "$_ " for @n; print MYFILE "\n"; print MYFILE "$_ " for @w; print MYFILE "\n"; close MYFILE; > > 637646.pl > > cat output.txt 1 2 3 5 6 7 book pencil desk foo bar baz >
    The input file has a couple of lines, each with a few word-number pairs. The output was formatted with all numbers on one line and all words on the next line.

    The does not employ arrays of arrays because I do not think that is necessary. Hope this helps.

Re: arrays of arrays
by grep (Monsignor) on Sep 07, 2007 at 13:48 UTC
    You want to split the string - so use split.

    ##UNTESTED use strict; use warnings; my $string = 'the/article book/noun he/pronoun is/verb ill/adjective'; my @array = split( '/', split( /\s+/, $string ) );
Re: arrays of arrays
by throop (Chaplain) on Sep 07, 2007 at 17:00 UTC
    You've gotten several good answers on the problem you think you have. But you have a larger problem.

    If this Perl code is really meant for linguistic research, (as opposed to being a toy program for a computational-linguistics 101) don't roll your own parser / tokenizer. Already your approach embodies design-decisions that will bite you.

    • It assigns each token to a single linguistic category; ignoring polysemia. What will it do with 'back', which can be a noun, verb, adverb, adjective or preposition?
    • Sometimes you'll want to tokenize across whitespace. Eg, 'break up' is better classed as a verb than as a verb+preposition token.
    I encourage you to check out IBM's (free, open-standard) UIMA.

    throop

Re: arrays of arrays
by Gangabass (Vicar) on Sep 07, 2007 at 13:58 UTC

    I don't fully understand what you need but you can try to push data like so:

    #The $. is the current line number in file push @{ $data{$.}{nums} }, $1; and push @{ $data{$.}{words} }, $1;

    After that you will have hash with line number and numbers and words for that line

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://637646]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-03-29 15:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found