term frequency and mutual info

perl_lover_always has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: term frequency and mutual info by jethro (Monsignor) on Oct 21, 2010 at 15:42 UTC
For each language file create a hash with the words as keys and a comma separated list of line numbers the data. Since that hash will be quite large, use a database to store the hash. A very popular solution for a disk based hash is DBM::Deep, easy to use, fast, well tested. If the hash fits into memory, you could accumulate the hash first in memory and then store it to disk. If not, initial creation of the hash will take somewhat longer, but not much thanks to disk caches. But it is a price you have to pay only once anyway After that finding out the lines where 'un' occured is just a simple hash accesses and a split, practically instantuous	[reply]
Re^2: term frequency and mutual info by perl_lover_always (Acolyte) on Oct 22, 2010 at 08:26 UTC
Thanks, I look into the database since I guess would be useful to create it once and use it anytime without cache waste.	[reply]
Re: term frequency and mutual info by CountZero (Bishop) on Oct 21, 2010 at 16:07 UTC
Perhaps something like this? use strict; use warnings; use 5.012; use Data::Dumper; my @file1 = ( 'this is an example.', 'the example is just for display.', 'how to solve it in an efficient way?', ); my @file2 = ( 'este es un ejemplo.', 'el ejemplo es sólo para mostrar.', 'cómo resolverlo de una manera eficiente?', ); say Dumper(parse(@file1)); say Dumper(parse(@file2)); sub parse { my @sentences = @_; my %words_catalogue; my $line = 1; for my $sentence (@sentences) { my @words = split ' ', $sentence; $words_catalogue{$_}{$line}++ for @words; $line++; } return \%words_catalogue; } [download] The split into words is very naive and you will have to look into it. Perhaps first discarding punctuation and the like. Still the idea should be clear. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]
Re^2: term frequency and mutual info by perl_lover_always (Acolyte) on Oct 22, 2010 at 08:28 UTC
Thanks! I thought of this before, however since my data is large and each time I process I can't afford to keep the cache, I guess it's better to create a DB once and use it anytime in the code!	[reply]
Re: term frequency and mutual info by raybies (Chaplain) on Oct 21, 2010 at 15:46 UTC
Fwiw, my inclination would be to open both files and read/process them simultaneously line by line. Dunno how efficient it is, but I'd create a hash of array ptrs of hash ptrs, where the basic layout was something like this... $hash->{word is key}->[each word gets an index]->{linenumber} where the value was a list of word numbers (adding another ->[index for each wordnum])or if that's too complex, just create a tally for how many times it appears in the line, and then search the line when you need to. (you might also consider using objects to make more readable the hash of array of hash of array of hash of ... etc... if that makes you squeamish.) That would make the second part of your problem not so difficult, because you could immediately access all words in a file, you'd know how many were in the full file by converting the array of hashptrs to scalar context and you'd have a linenumber entry for each entry. i assume in your second example you meant "un" and "an", not and... curious, but is this to draw a correllation algorithmically between the meanings of words, by how often they appear in the same lines? IOW is the intent to look at all words on a line, and see if they consistently show up on each corresponding line and thus draw out the meaning?	[reply]
Re^2: term frequency and mutual info by perl_lover_always (Acolyte) on Oct 22, 2010 at 08:27 UTC
basically some mutual information would give a hint about how close are the words in context! So having some mutual info in the statistics helps to enrich the feature space.	[reply]
Re: term frequency and mutual info by salva (Canon) on Oct 21, 2010 at 16:37 UTC
I guess you want to do something like this: `# untested! open my $fh1, ... open my $fh2, ... my %pairs; while (1) { my $l1 = <$fh1>; my $l2 = <$fh2>; last unless (defined $l1 and defined $l2); my @l1 = $l1 =~ /\w+/g; my @l2 = $l2 =~ /\w+/g; for my $w1 (@l1) { for my $w2 (@l2) { $pairs{"$w1-$w2"}++; } } } my @sorted = sort { $pairs{$b} <=> $pairs{$a} } keys %pairs; for my $k (@sorted) { say "$pairs{$k} ==> $k" }` [download]	[reply] [d/l]
Re: term frequency and mutual info by kcott (Archbishop) on Oct 21, 2010 at 15:42 UTC
The answer will depend on whether later refers to later in the script or at some later time. If later in the script, then simply reading through the two files and the storing results in an array of hashes might be easiest: `$fileA[$.]{$keyword}++` [download] If you want this for subsequent processing, then a database solution is possibly the way to go. -- Ken	[reply] [d/l]
Re^2: term frequency and mutual info by Anonymous Monk on Oct 22, 2010 at 08:23 UTC
Thanks for the hint! I guess the database solution is feasible since I don't need to create it over and over each time I run my codes!	[reply]
Re: term frequency and mutual info by planetscape (Chancellor) on Oct 22, 2010 at 02:13 UTC
I think you will want to take a look at Ted Pedersen's Ngram Statistics and SenseClusters packages. Additional search terms that may help would be concordance, collocation, and alignment. HTH, planetscape	[reply]
Re^2: term frequency and mutual info by perl_lover_always (Acolyte) on Oct 22, 2010 at 08:31 UTC
Well, I'm very familiar with those however there are some limitations and some restrictions! Since I have parallel corpus I need the line number to be indexed! moreover is not efficient to change their code and package totally! although the work is clean and interesting!	[reply]