findtheriver has asked for the wisdom of the Perl Monks concerning the following question:

Hi everybody!
I've got a text file like this one:
1) atomo/atomo/S * senza/senza/E * nucleo/nucleo/S 2) chitarra/chitarra/S * a/a/E * corde/corda/S 3) coltello/coltello/S * dalla/da/E * lama/lama/S 4) edificio/edificio/S * ad/ad/E * facciata/facciata/S 5) biciclette/bicicletta/S * a/a/E * ruote/ruota/S 6) computer/computer/S * con/con/E * processore/processore/S 7) chiesa/chiesa/S * con/con/E * absidi/abside/S 8) opera/opera/S * con/con/E * volumi/volume/S 9) strada/strada/S * a/a/E * carreggiate/carreggiata/S 10) chitarra/chitarra/S .* a/a/E .* corde/corda/S 11) edificio/edificio/S .* con/con/E .* facciata/facciata/S 12) Codice/codice/S .* scritto/scrivere/V sulle/su/E .* lettere/le +ttera/S 13) computer/computer/S .* basati/basare/V su/su/E .* processore/p +rocessore/S 14) chiesa/chiesa/S .* con/con/E .* absidi/abside/S 15) opera/opera/S .* con/con/E .* volumi/volume/S 16) strada/strada/S .* a/a/E .* carreggiate/carreggiata/S 17) atomo/atomo/S .* senza/senza/E .* nucleo/nucleo/S 18) coltello/coltello/S .* dalla/da/E .* lama/lama/S 19) biciclette/bicicletta/S .* a/a/E .* ruote/ruota/S 20) coltello/coltello/S .* a/a/E .* lama/lama/S 21) codice/codice/S .* di/di/E .* lettere/lettera/S 22) biciclette/bicicletta/S .* a/a/E .* ruote/ruota/S 23) testa/testa/S .* di/di/E .* fronte/fronte/S

The first and last "unit" (by unit I mean a everything like this: word/word/TOW) are also in a text file, in which they're written down as a couple, like this:
[Nn]ucle[oi]:[Pp]roton[oi] OCS:chip [Ff]otosistema:LHC N2:[aA]zoto [Cc]enobio:[Cc]appell[ae] [Ee]sercit[oi]:[Ll]egion[ie] [Tt]erreno:sabbia [Ll]attosio:[Gg]lucosio [Cc]odic[ei]:[Ll]etter[ae] [aA]ttinio:[Ii]sotop[oi] [Cc]erio:[Ii]sotop[oi]

What I'd like to do is count everytime a certain relation, let's say con/con/E appear with every couple of words.
I mean, what I expect to obtain is a text file like this: [Nn]ucle[io]-[Pp]roton[ei]-->4 where 4 is obviously the count of everytime the couple is seen with the give relation.
What I did is the following:
#!/usr/bin/perl use strict; use warnings; open my $listaParole,"File_Input/Coppie_Parole.txt" or die; my %hash; while (my $line=<$listaParole>) { chomp $line; my ($word1, $word2) = split /:/, $line; $hash{$word1} = $word2; } open my $input, "<Wiki_Pulito/Prova/Pattern2.txt" or die; # Carico la parte di file di testo che va analizzata open my $conteggio, ">Wiki_Pulito/Prova/Conteggio.txt"; # Apro il file di output my $conto=0; my %arrayris; while (my $text=<$input>){ for my $key (keys %hash){ my $value = $hash{$key}; while ($text =~/(($key\/$key\/S)\s{0,2}(\.\*)\s{0,2}(con\/con\/E)\ +s{0,2}(\.\*)\s{0,2}($value\/$value\/S))/is){ $conto++; } my $arrkey=$key."-".$value; $arrayris{$arrkey}=$conto; } } while ( my ($k,$v) = each %arrayris ) { print $conteggio "($k) => $v\n"; } close $input; close $conteggio;
but I got something wrong, since all I got is a serie of 0.
I'm sorry if I haven't explained my problem too well, but I'm italian.
Also, I've been into perl just for a little while and I'm pretty new to porgramming in general.
Thanks averyone for your help..

Replies are listed 'Best First'.
Re: Problem in counting the occurrences of a string in a text file
by linuxer (Curate) on Dec 29, 2008 at 15:21 UTC
    • please use a consistent way of intendation; it makes the code more readable; I think perlstyle has something about that.
    • the mix of english and italian variable names doesn't help to understand the code
    • the italian comments don't help those who don't speak Italian
    • your regex cannot match, as you use (\.\+) in it, but where in your data source is the string .*?
    • Is it correct, that in your data source, some lines use '*', other '.*' to separate the items per line?
    • you use \s{0,2} in your regex; making it something like "string(0-2 whitespaces)string"; but I saw s/th like "string(3 whitespaces)string" in your data source; maybe you should consider using \s*

    update

    1. question changed
    2. +: whitespace regex
    3. fixed typo
Re: Problem in counting the occurrences of a string in a text file
by u671296 (Sexton) on Dec 29, 2008 at 16:22 UTC
    Hi,
    It would help if you presented less data and your examples reflected the data you are using.

    Some initial thoughts:
    1) The while loop around the Regex will be infinite if a match is found. I think that should read
    $conto++ if $value =~/(($key\/$key\/S)\s{0,2}(\.\*)\s{0,2}(con\/con\/E)\s{0,2}(\.\*)\s{0,2}($value\/$value\/S))/is){
    2) The while, open & close statements can be improved see below 3) You are inconsistent with your die and close statements, perhaps because you haven't got round to tidying them up yet
    4) The con\/con\/E part of the Regex probably needs to be in a variable so you can loop through the other possibilities e.g. "dalla/da/E"
    5) As your Regex is ignoring case the e.g. Nnoption is redundant.

    The following code seems to work and incorporates some of the above points. I've also simplified the Regex as the example data works with this Regex.
    #!/usr/bin/perl use strict; use warnings; open( INPUT, "<Wiki_Pulito/Prova/Pattern2.txt") or die "Can't open Pat +tern2.txt"; open( LISTAPAROLE,"<File_Input/Coppie_Parole.txt") or die "Can't open +Coppie_Parole.txt"; my %hash; while (<INPUT>) { chomp; my ($word1, $word2) = split /:/, $_; $hash{$word1} = $word2; } close INPUT; # Carico la parte di file di testo che va analizzata open( CONTEGGIO, ">Wiki_Pulito/Prova/Conteggio.txt") or die "Can't ope +n Conteggio.txt"; # Apro il file di output my $conto=0; my %arrayris; while (my $text = <LISTAPAROLE>){ for my $key (keys %hash){ my $value = $hash{$key}; if ($text =~/$key\/$key\/.*con\/con\/E.*$value\/$value +\/S/is){ $conto++; } my $arrkey=$key."-".$value; $arrayris{$arrkey}=$conto; } } while ( my ($k,$v) = each %arrayris ) { print CONTEGGIO "($k) => $v\n"; } close LISTAPAROLE; close CONTEGGIO;
      OK, a typo in the previous reply.

      $conto++ if $value =~/(($key\/$key\/S)\s{0,2}(\.\*)\s{0,2}(con\/con\/E +)\s{0,2}(\.\*)\s{0,2}($value\/$value\/S))/is){

      should read

      $conto++ if $text =~/(($key\/$key\/S)\s{0,2}(\.\*)\s{0,2}(con\/con\/E) +\s{0,2}(\.\*)\s{0,2}($value\/$value\/S))/is){

      Thanks for the comment.
      But if I try using your code, I get the same result as mine. Every couple has the same number of occurrences, and that just isn't possible. I'm sorry if I cannot make it more clear, is just I don't know how to explain that.

        I think your problem is, that you only have one counter variable, which is increased for each individual matching.

        Consider something like this:

        #my $conto = 0; #### REMOVED; not needed my %arrayris; while (my $text = <LISTAPAROLE>){ for my $key (keys %hash){ my $value = $hash{$key}; if ($text =~/$key\/$key\/.*con\/con\/E.*$value\/$value\/S/is){ ### increase for each key/value pair individually $arrayris{ join '-', $key, $value }++; } } } while ( my ($k,$v) = each %arrayris ) { print CONTEGGIO "($k) => $v\n"; }
Re: Problem in counting the occurrences of a string in a text file
by Anonymous Monk on Dec 29, 2008 at 21:42 UTC
    I tried following your suggestions and working on the code, but I am not capable of make it work.
    What I'd like to do is count every occurrence of the given relation (con/con/E is a relation) with every couple of work I have. If in my file I have 3 occurrences of the string
    computer/computer/S  .* con/con/E   .* processore/processore/S
    I'd like for my output to be like this:
    computer-processore)-->3
    The new code I wrote is this one (I tried to make it more readable, and, by the way, I also fixed the fact that my input file had some line with just * and some other with .*) :
    #!/usr/bin/perl use strict; use warnings; open my $listaParole,"File_Input/Coppie_Parole.txt" or die; my %hash; while (my $line=<$listaParole>) { chomp $line; my ($word1, $word2) = split /:/, $line; $hash{$word1} = $word2; } close $listaParole; open my $testo, "<Wiki_Pulito/Prova/Pattern2.txt" or die; open my $conteggio, ">Wiki_Pulito/Prova/Conteggio1.txt" or die; my $count=0; my %arrayris; while (my $text=<$testo>){ for my $key (keys %hash){ my $value = $hash{$key}; if ($text =~/($key\/$key\/S)\s{0,4}(\.\*)\s{0,4}(con\/con\/E)\s{0 +,4}(\.\*)\s{0,4}($value\/$value\/S)\b/g){ $count++; } my $arrkey=$key."-".$value; $arrayris{$arrkey}=$count; } } while ( my ($k,$v) = each %arrayris ) { print $conteggio "($k) => $v\n"; } close $testo; close $conteggio;

    Problem is, when I get the output, is a nice list and all, but it's not possible that every couple and relation has exactly the same number of occurrences. And that is exactly what I get, like this:
    ([aA]mplificator[ei]-[Tt]ransistor) => 27 ([cC]ervello-[Tt]alamo) => 27 ([Ee]ucariot[ia]-[Mm]embran[ae]) => 27 ([Cc]erio-[Ii]sotop[oi]) => 27 ([Cc]ellul[ae]-[Nn]ucle[oi]) => 27 ([Tt]ronco-[Tt]orace) => 27 ([Bb]raccio-[Aa]vambraccio) => 27
    It says 27 even for couple that never appear in the same string with the given relation.
    Thanks everyone for the help! Every suggestion is very well appreciated!

      Did you see and consider my reply (though its based upon u67129's code)?

        No. I just saw and tried it. It works now, perfectly fine. Thank you so much! You have no idea how much I appreciate it!