perl_lover_always has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Monks I'm looking for advice for my problem. I have two large texts with same number of lines. In fact they are parallel texts in two different languages. I want the statistics of each term occurred in each file plus the line number for each one since I will later calculates the terms that happened to be in the same line. for example:
FILE1: this is an example. the example is just for display. how to solve it in an efficient way? FILE2: este es un ejemplo. el ejemplo es sólo para mostrar. cómo resolverlo de una manera eficiente?
later I want to extract for example that how many times "un" and "and" happened to be in a same line of the files. Since the files are big, I have to reprocess them inside my codes, so I prefer a code to be efficient. I know some solutions which are not really efficient nor the good solution for this and I'm not an expert of perl nor very beginner ;) Your advice is appreciated.

Replies are listed 'Best First'.
Re: term frequency and mutual info
by jethro (Monsignor) on Oct 21, 2010 at 15:42 UTC
    For each language file create a hash with the words as keys and a comma separated list of line numbers the data.

    Since that hash will be quite large, use a database to store the hash. A very popular solution for a disk based hash is DBM::Deep, easy to use, fast, well tested.

    If the hash fits into memory, you could accumulate the hash first in memory and then store it to disk. If not, initial creation of the hash will take somewhat longer, but not much thanks to disk caches. But it is a price you have to pay only once anyway

    After that finding out the lines where 'un' occured is just a simple hash accesses and a split, practically instantuous

      Thanks, I look into the database since I guess would be useful to create it once and use it anytime without cache waste.
Re: term frequency and mutual info
by CountZero (Bishop) on Oct 21, 2010 at 16:07 UTC
    Perhaps something like this?
    use strict; use warnings; use 5.012; use Data::Dumper; my @file1 = ( 'this is an example.', 'the example is just for display.', 'how to solve it in an efficient way?', ); my @file2 = ( 'este es un ejemplo.', 'el ejemplo es sólo para mostrar.', 'cómo resolverlo de una manera eficiente?', ); say Dumper(parse(@file1)); say Dumper(parse(@file2)); sub parse { my @sentences = @_; my %words_catalogue; my $line = 1; for my $sentence (@sentences) { my @words = split ' ', $sentence; $words_catalogue{$_}{$line}++ for @words; $line++; } return \%words_catalogue; }
    The split into words is very naive and you will have to look into it. Perhaps first discarding punctuation and the like. Still the idea should be clear.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Thanks! I thought of this before, however since my data is large and each time I process I can't afford to keep the cache, I guess it's better to create a DB once and use it anytime in the code!
Re: term frequency and mutual info
by raybies (Chaplain) on Oct 21, 2010 at 15:46 UTC

    Fwiw, my inclination would be to open both files and read/process them simultaneously line by line. Dunno how efficient it is, but I'd create a hash of array ptrs of hash ptrs, where the basic layout was something like this...

    $hash->{word is key}->[each word gets an index]->{linenumber} where the value was a list of word numbers (adding another ->[index for each wordnum])or if that's too complex, just create a tally for how many times it appears in the line, and then search the line when you need to.

    (you might also consider using objects to make more readable the hash of array of hash of array of hash of ... etc... if that makes you squeamish.)

    That would make the second part of your problem not so difficult, because you could immediately access all words in a file, you'd know how many were in the full file by converting the array of hashptrs to scalar context and you'd have a linenumber entry for each entry.

    i assume in your second example you meant "un" and "an", not and...

    curious, but is this to draw a correllation algorithmically between the meanings of words, by how often they appear in the same lines? IOW is the intent to look at all words on a line, and see if they consistently show up on each corresponding line and thus draw out the meaning?

      basically some mutual information would give a hint about how close are the words in context! So having some mutual info in the statistics helps to enrich the feature space.
Re: term frequency and mutual info
by salva (Canon) on Oct 21, 2010 at 16:37 UTC
    I guess you want to do something like this:
    # untested! open my $fh1, ... open my $fh2, ... my %pairs; while (1) { my $l1 = <$fh1>; my $l2 = <$fh2>; last unless (defined $l1 and defined $l2); my @l1 = $l1 =~ /\w+/g; my @l2 = $l2 =~ /\w+/g; for my $w1 (@l1) { for my $w2 (@l2) { $pairs{"$w1-$w2"}++; } } } my @sorted = sort { $pairs{$b} <=> $pairs{$a} } keys %pairs; for my $k (@sorted) { say "$pairs{$k} ==> $k" }
Re: term frequency and mutual info
by kcott (Archbishop) on Oct 21, 2010 at 15:42 UTC

    The answer will depend on whether later refers to later in the script or at some later time.

    If later in the script, then simply reading through the two files and the storing results in an array of hashes might be easiest:

    $fileA[$.]{$keyword}++

    If you want this for subsequent processing, then a database solution is possibly the way to go.

    -- Ken

      Thanks for the hint! I guess the database solution is feasible since I don't need to create it over and over each time I run my codes!
Re: term frequency and mutual info
by planetscape (Chancellor) on Oct 22, 2010 at 02:13 UTC
      Well, I'm very familiar with those however there are some limitations and some restrictions! Since I have parallel corpus I need the line number to be indexed! moreover is not efficient to change their code and package totally! although the work is clean and interesting!