Cycle, iterations and statistical measure got completely wrong

remluvr has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks. I tried posting a similar question earlier, but since I made a little bit (or a lot of) mess while asking it, and since three days of working on provided not a lot of progress, I thought I'd try posting again, trying to make my problem more clear.
My input is like the following:

1    beast-n    into    transform-v    356.9551
2    beast-n    obj    kill-v    266.2511
3    beast-n    obj    see-v    252.3623
4    beast-n    prd    become-v    250.9534
5    beast-n    obj    turn-v    224.6948
6    beast-n    obj    call-v    171.4000
7    beast-n    sbj_intr    devour-v    165.3228
8    beast-n    obj    hunt-v    155.7637
9    beast-n    obj    fight-v    150.4370
10    beast-n    obj    slay-v    150.3982
1    frog-n    obj    find-v    322.5589
2    frog-n    into    turn-v    307.3012
3    frog-n    sbj_intr    jump-v    235.0503
4    frog-n    coord-1    toad-n    207.3611
5    frog-n    obj    see-v    207.2610
6    frog-n    obj    eat-v    204.1762
7    frog-n    obj    kill-v    64.6689
[download]

But please, take in account this is just a sample, since the actual output is more or less 4 Giga of text structured like the previous.
Also, I take in input something like this:

frog-n    amphibian_reptile    hyper    beast-n
[download]

Even in this case, consider this is just a sample of the actual file I have.
What I need to do is a little complicated, so I hope I can explain it properly. Please, if it is not clear, ask me and I'd be glad to provide further infos.
For every entry in the second file I provided as sample, I need to check the occurrence of the first field and last field of it in the first file.
I then have to scan the first file for every line in which occur the first field of the other file (in this case, every line that has frog-n as second field) and see what the fourth field of it contains (i.e. the first entry containing frog-n, has find-v). I now have to check if find-v occurs with any entry of the same file that has beast-n (the second term of the hyper relation in my other input) as first field. In this case, find-v does not occur with it, so I have to check the following line of frog-n. Its fourth field is turn-v. I check for occurrence of it with beast-n and see that it occur with it. So, I have to compute Precision for frog-n, that is the number of found feature that occurs both with frog-n and beast-n / the rank of frog, which in this case would be 1(found feat)/2(rank of frog-n at this point). Then I need to extract also the rank of beast-n in which I found turn-v (which in this case would be 5) and the total number of occurrences of beast-n in the file (10), in order to compute a measure of reduction to apply to the just computed precision.
The measure is 1-rankfoundfv/(rankfv+1) and here, it would be 1-5/10+1=0,545454545454. So, the association measure I need would be the precision previously computed * the reduction measure I got= 1/2*0,545454545454=0,272727272727.
I have to repeat this for every entry of frog and in the end (when I rach the last entry of frog) sum all the association measure I got and divide the result for the number of occurrences of frog.

Here's my code:

#!/usr/bin/perl -w


use strict;
use warnings;
use Getopt::Std;
use Data::Dumper ;


my $usage;
{
$usage = <<"_USAGE_";
_USAGE_
}


my %opts = ();

getopts('h',\%opts);

if ($opts{h}) {
    print $usage;
    exit;
}


my $prefix = shift;
my $input = shift;
my $input_bless=shift;


my $file = $prefix . ".txt";


if (-e $file) {
    print STDERR "$file already exists, deleting previous version\n";
    `rm -f $file`;
}

#my $debug=0; #Variabile di debug. Vale 1 in fase di debug, si usa per

open INPUT,$input;

open OUT,">$file";
my %matrice;
while (<INPUT>) {
    my ($rank, $nome, $relaz, $entry2, $score) = split();
    push @{ $matrice{$nome} }, "${relaz}_$entry2,$rank,$score";
}


#print OUT Dumper  \%matrice;


my %HOH;
while (my($name,$aref) = each %matrice ) {
    for my $item (@$aref) {
        my($prop,$rank,$score) = split(',',$item);
        #push @ {$HOH{$name}{$rank}{$prop}}, "$score";
        $HOH{$name}{$prop} = "${rank},$score"|| 0;
    }
}

#print Dumper  \%HOH;

open INPUTB,$input_bless;


#my %descriptions;

while (<INPUTB>)
{
   my ($u, $superclass,$rel,$v) = (split)[0,1,2,3];
    my $conteggio=&calcolo($u,$v);
   
   print OUT "$u"."\t".$rel."\t".$v."\t".$conteggio."\n";
}
close INPUTB;
close OUT;
   
   
   
   sub calcolo{
   
        my ($name1, $name2)=@_;
    my $first  = $HOH{$name1};
    my $second = $HOH{$name2};
    my ($rank_fv,$score_fv);
    my $rank_v;
    my $count_feat_fv;
   # my ($prop,$rank, $score);
    my $provaprec=0;
    my $proptoexamine;
    my $count_feat_rel;
    my $precision;
    my $rel_par;
    my $num=0;
    my $count_feat_fu;
    my $apinc;
    my $rank2=0;
    #my ($prop2,$rank2, $score2);
  my($prop,$score);
  my $rank=0;
  my $feature_found=0;
  my $feat_finale;
  my $rel_to_sum;
   
   while (my($name1,$aref) = each %matrice ) {
    $count_feat_fu++;
    
   $num=0;
    my $feat_rel=0;
    for my $item (@$aref) {
    ($prop,$rank, $score) = split(',',$item);
    
#    print "PROP: ".$prop."\n";
    
    # if (exists $second->{$prop}){
        #$count_feat_fv=0;
        #    $feat_rel++;
         #   print "FEAT REL: ".$feat_rel."\n";
          #  print "#####################################\n";
           # print "ENTRATO\n";
            #$proptoexamine=$prop;
            #print "PROP TO EXAMINE: ".$proptoexamine."\n";
            #$count_feat_rel++;
            #print "RANK1: ".$rank."\n";
            #$precision = $feat_rel/$rank;
            #print "PRECISION: ".$precision."\n";
             $feat_finale=&last_el_v($name2,$prop);
             
        while (my($prop1,$rankscore1) = each %$second ){
            ($rank_fv,$score_fv) = split(',',$rankscore1);
        
        if ($prop1 eq $prop){
            $feature_found++;
            print "PROP:".$prop."\n";
            print "TROVATO\n";
             $feat_rel++;
            $rank2=$rank_fv;
            print "RANK 2:".$rank2."\n";
            print "RANK 1: ".$rank."\n";
            print "COUNT FEAT FOUND: ".$feature_found."\n";
            $precision=$feature_found/$rank;
             print "PRECISION: ".$precision."\n";
              $rel_par=$rank2/($feat_finale+1);
              $rel_to_sum=1-($rel_par);
              print "FEAT FINALE: ".$feat_finale."\n";
              print "REL PAR: ".$rel_par."\n";
             my $tosum= $precision*$rel_to_sum;
              print "TO SUM: ".$tosum."\n";
              $num =$num+$tosum;
               print "NUM: ".$num."\n";
        }
        
        #}
     #   print "RANK 2:".$rank2."\n";
        #
       
        #print "*********************************\n";
       
       #print "RELPAR".$rel_par."\n";
      # my $rel_tot=1-$rel_par;
      #print "REL TOT".$rel_tot."\n";
      #  print "*********************************\n";
    #  my  $tosum=$precision*$rel_tot;
     #print "TO SUM".$tosum."\n";
     #print "NUM".$num."\n";
    #  $num=$num+$tosum;
      #print "####################################\n";
       #  print "NUM: ".$num."\n";
        # print "RANK: ".$rank."\n";
        
       #  print "APINC ".$apinc."\n";
         
       }
     #   $feat_finale=$count_feat_fv;
    #print "COUNT FEAT FV ".$count_feat_fv."\n";
     
       }
    print "RANK: ".$rank."\n";
     $apinc=$num/$rank;
     
       print "APINC: ".$apinc."\n";
     # print "COUNT FEAT FV".$count_feat_fv."\n";
    #print "APINC".$apinc."\n";
     
    # print "COUNT FEAT U: ".$count_feat_fu."\n";
     
   
    return $apinc;
    }
  

    # print $prop."\t".$rank."\t".$score."\n";
    
}
   
   
   
sub last_el_v{
     my ($name2,$prop1)=@_;
    #my $first  = $HOH{$name1};
    my $second = $HOH{$name2};
    my $count_feat_fv;
    
    while (my($prop1,$rankscore1) = each %$second ){
        $count_feat_fv++;
   }
    return $count_feat_fv;
   }
[download]

And the result I got when I run it on previous data is wrong, since it looks like it runs just over every entry of beast-n and repeats the calculations on it.
I've been spending three days trying to fix this but it looks like I can't succeed. Any ideas?
Thanks in advance
Giulia

Comment on Cycle, iterations and statistical measure got completely wrong Select or Download Code

Replies are listed 'Best First'.
Re: Cycle, iterations and statistical measure got completely wrong by roboticus (Chancellor) on Mar 14, 2012 at 11:02 UTC
remluvr: And the result I got when I run it on previous data is wrong, since it looks like it runs just over every entry of beast-n and repeats the calculations on it. Yes, that's clearly what you're doing: `while (<INPUTB>) { my ($u, $superclass,$rel,$v) = (split)[0,1,2,3]; my $conteggio=&calcolo($u,$v); print OUT "$u"."\t".$rel."\t".$v."\t".$conteggio."\n"; }` [download] Perhaps it would be better to read lines in groups, so you have similar records at the same time, then you can pass the smaller set of records into your calculation routine. Something like (untested): `my $cur_superclass; my @records; while (<INPUTB>) { my ($u, $superclass,$rel,$v) = (split)[0,1,2,3]; $cur_superclass = $superclass if !defined($cur_superclass); if ($cur_superlclass ne $superclass) { my $conteggio=calcolo(\@records); print OUT "$u"."\t".$rel."\t".$v."\t".$conteggio."\n"; @records = (); } push @records, [ $u, $superclass, $rel, $v ]; $cur_superclass = $superclass; }` [download] This will, of course, complicate `calcolo()` a bit, as you'll have to iterate over the records to find what you're looking for. But at least it will ensure that you're processing all the related records at once, rather than repeating the calculation over and over: `sub calcolo { my $rRecs = shift; # Number of records my $found_recs = @{$rRecs}; for my $rRec (@$rRecs) { # compute partial values my ($u, $superclass, $rel, $v) = @$rRec; .... } # combine partials return $partials / $found_recs; # or some such... }` [download] ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re: Cycle, iterations and statistical measure got completely wrong by Anonymous Monk on Mar 14, 2012 at 09:48 UTC
Hi :) Clobbering the file, overwriting the file, overwrites it, no need to remove it first :) so you can delete these lines if (-e $file) { print STDERR "$file already exists, deleting previous version\n"; `rm -f $file`; } [download] Also, you should check if open failed, you can let autodie do the checking for you	[reply] [d/l]
Re: Cycle, iterations and statistical measure got completely wrong by Khen1950fx (Canon) on Mar 14, 2012 at 11:53 UTC
I altered your script so that it would just work. The biggest problem that I encountered was getting around the illegal divsion by zero such as: `$precision = $feature_found / $rank;` [download] What I did: `$precision = eval { $feature_found ? $rank / $feature_found : 0; };` [download] Here's the script: #!/usr/bin/perl use strict; use autodie; use Devel::SimpleTrace; use Data::Dumper::Concise; my $input = shift @ARGV; my $input_bless = ''; my $log = '/root/Desktop/log.txt'; die $! unless open IN, '<', $input; die $! unless open OUT, '>', $log; my %matrice; while ( defined( $_ = <IN> ) ) { my ( $rank, $nome, $relaz, $entry2, $score ) = split( ' ', $_, 6 ) +; push @{ $matrice{$nome}; }, "${relaz}_$entry2,$rank,$score"; } my %HOH; while ( my ( $name, $aref ) = each %matrice ) { do { foreach my $item (@$aref) { my ( $prop, $rank, $score ) = split( /,/, $item, 4 ); $HOH{$name}{$prop} = "$rank,$score" \|\| 0; } }; } die unless open INB, '<', $input; while ( defined( $_ = <INB> ) ) { my ( $u, $superclass, $rel, $v ) = ( split( ' ', $_, 0 ) )[ 0, 1, 2, 3 ]; my $conteggio = &calcolo( $u, $v ); print OUT "$u" . "\t" . $rel . "\t" . $v . "\t" . $conteggio; } sub calcolo { my ( $name1, $name2 ) = @_; my $first = $HOH{$name1}; my $second = $HOH{$name2}; my ( $rank_fv, $score_fv ); my $rank_v; my $count_feat_fv; my $provaprec = 0; my $proptoexamine; my $count_feat_rel; my $precision; my $rel_par; my $num = 0; my $count_feat_fu; my $apinc; my $rank2 = 0; my ( $prop, $score ); my $rank = 0; my $feature_found = 0; my $feat_finale; my $rel_to_sum; while ( my ( $name1, $aref ) = each %matrice ) { ++$count_feat_fu; $num = 0; my $feat_rel = 0; foreach my $item (@$aref) { ( $prop, $rank, $score ) = split( /,/, $item, 4 ); $feat_finale = &last_el_v( $name2, $prop ); while ( my ( $prop1, $rankscore1 ) = each %$second ) { ($rank_fv, $score_fv ) = split( /,/, $rankscore1, 3); if ( $prop1 eq $prop ) { ++$feature_found; print 'PROP:' . $prop . "\n"; print "TROVATO" . "\n"; ++$feat_rel; $rank2 = $rank_fv; print 'RANK 2:' . $rank2 . "\n"; print 'RANK 1: ' . $rank . "\n"; print 'COUNT FEAT FOUND: ' . $feature_found . "\n" +; $precision = eval { $feature_found ? $rank / $feature_found : 0; }; print 'PRECISION: ' . $precision . "\n"; $rel_par = $rank2 / ( $feat_finale + 1 ); $rel_to_sum = 1 - $rel_par; print 'FEAT FINALE: ' . $feat_finale . "\n"; print 'REL PAR: ' . $rel_par . "\n"; my $tosum = $precision * $rel_to_sum; print 'TO SUM: ' . $tosum . "\n"; $num = $num + $tosum; print 'NUM: ' . $num . "\n"; } }; } print 'RANK: ' . $rank . "\n"; $apinc = eval { $num ? $rank / $num : 0; }; print 'APINC: ' . $apinc . "\n"; return $apinc; }; } sub last_el_v { my ( $name2, $prop1 ) = @_; my $second = $HOH{$name2}; my $count_feat_fv; while ( my ( $prop1, $rankscore1 ) = each %$second ) { ++$count_feat_fv; }; return $count_feat_fv; } close IN; close INB; close OUT; [download]	[reply] [d/l] [select]