Problem in counting the occurrences of a string in a text file

findtheriver has asked for the wisdom of the Perl Monks concerning the following question:

Hi everybody!
I've got a text file like this one:

1) atomo/atomo/S   * senza/senza/E   * nucleo/nucleo/S
2) chitarra/chitarra/S   * a/a/E   * corde/corda/S
3) coltello/coltello/S   * dalla/da/E   * lama/lama/S
4) edificio/edificio/S   * ad/ad/E   * facciata/facciata/S
5) biciclette/bicicletta/S   * a/a/E   * ruote/ruota/S
6) computer/computer/S   * con/con/E   * processore/processore/S
7) chiesa/chiesa/S   * con/con/E   * absidi/abside/S
8) opera/opera/S   * con/con/E   * volumi/volume/S
9) strada/strada/S   * a/a/E   * carreggiate/carreggiata/S
10) chitarra/chitarra/S  .* a/a/E  .* corde/corda/S
11) edificio/edificio/S  .* con/con/E  .* facciata/facciata/S
12) Codice/codice/S  .* scritto/scrivere/V   sulle/su/E  .* lettere/le
+ttera/S
13) computer/computer/S  .* basati/basare/V   su/su/E  .* processore/p
+rocessore/S
14) chiesa/chiesa/S  .* con/con/E  .* absidi/abside/S
15) opera/opera/S  .* con/con/E  .* volumi/volume/S
16) strada/strada/S  .* a/a/E  .* carreggiate/carreggiata/S
17) atomo/atomo/S  .* senza/senza/E  .* nucleo/nucleo/S
18) coltello/coltello/S  .* dalla/da/E  .* lama/lama/S
19) biciclette/bicicletta/S  .* a/a/E  .* ruote/ruota/S
20) coltello/coltello/S  .* a/a/E  .* lama/lama/S
21) codice/codice/S  .* di/di/E  .* lettere/lettera/S
22) biciclette/bicicletta/S  .* a/a/E  .* ruote/ruota/S
23) testa/testa/S  .* di/di/E  .* fronte/fronte/S
[download]

The first and last "unit" (by unit I mean a everything like this: word/word/TOW) are also in a text file, in which they're written down as a couple, like this:

[Nn]ucle[oi]:[Pp]roton[oi]
OCS:chip
[Ff]otosistema:LHC
N2:[aA]zoto
[Cc]enobio:[Cc]appell[ae]
[Ee]sercit[oi]:[Ll]egion[ie]
[Tt]erreno:sabbia
[Ll]attosio:[Gg]lucosio
[Cc]odic[ei]:[Ll]etter[ae]
[aA]ttinio:[Ii]sotop[oi]
[Cc]erio:[Ii]sotop[oi]
[download]

What I'd like to do is count everytime a certain relation, let's say con/con/E appear with every couple of words.
I mean, what I expect to obtain is a text file like this: [Nn]ucle[io]-[Pp]roton[ei]-->4 where 4 is obviously the count of everytime the couple is seen with the give relation.
What I did is the following:

#!/usr/bin/perl
use strict;
use warnings;
open my $listaParole,"File_Input/Coppie_Parole.txt" or die;
 
my %hash;
while (my $line=<$listaParole>) {
chomp $line;
my ($word1, $word2) = split /:/, $line;

$hash{$word1} = $word2;
}

open my $input, "<Wiki_Pulito/Prova/Pattern2.txt" or die;  
# Carico la parte di file di testo che va analizzata
open my $conteggio, ">Wiki_Pulito/Prova/Conteggio.txt";
# Apro il file di output
my $conto=0;
my %arrayris;
    while (my $text=<$input>){
     for my $key (keys %hash){
     my $value = $hash{$key};
    while ($text =~/(($key\/$key\/S)\s{0,2}(\.\*)\s{0,2}(con\/con\/E)\
+s{0,2}(\.\*)\s{0,2}($value\/$value\/S))/is){
     $conto++;
}
     my $arrkey=$key."-".$value;
     $arrayris{$arrkey}=$conto;


      
}
}
while ( my ($k,$v) = each %arrayris ) {
    print $conteggio "($k) => $v\n";
  
}
close $input;
close $conteggio;
[download]

but I got something wrong, since all I got is a serie of 0.
I'm sorry if I haven't explained my problem too well, but I'm italian.
Also, I've been into perl just for a little while and I'm pretty new to porgramming in general.
Thanks averyone for your help..

Comment on Problem in counting the occurrences of a string in a text file Select or Download Code

Replies are listed 'Best First'.
Re: Problem in counting the occurrences of a string in a text file by linuxer (Curate) on Dec 29, 2008 at 15:21 UTC
please use a consistent way of intendation; it makes the code more readable; I think perlstyle has something about that. the mix of english and italian variable names doesn't help to understand the code the italian comments don't help those who don't speak Italian ~~your regex cannot match, as you use `(\.\+)` in it, but where in your data source is the string `.`?~~ Is it correct, that in your data source, some lines use '', other '.' to separate the items per line? you use `\s{0,2}` in your regex; making it something like "string(0-2* whitespaces)string"; but I saw s/th like "string(3 whitespaces)string" in your data source; maybe you should consider using `\s*` update question changed +: whitespace regex fixed typo	[reply] [d/l] [select]
Re^2: Problem in counting the occurrences of a string in a text file by Anonymous Monk on Dec 29, 2008 at 15:34 UTC
perltidy helps with the formatting	[reply]
Re: Problem in counting the occurrences of a string in a text file by u671296 (Sexton) on Dec 29, 2008 at 16:22 UTC
Hi, It would help if you presented less data and your examples reflected the data you are using. Some initial thoughts: 1) The while loop around the Regex will be infinite if a match is found. I think that should read `$conto++ if $value =~/(($key\/$key\/S)\s{0,2}(\.\)\s{0,2}(con\/con\/E)\s{0,2}(\.\)\s{0,2}($value\/$value\/S))/is){` 2) The while, open & close statements can be improved see below 3) You are inconsistent with your die and close statements, perhaps because you haven't got round to tidying them up yet 4) The con\/con\/E part of the Regex probably needs to be in a variable so you can loop through the other possibilities e.g. "dalla/da/E" 5) As your Regex is ignoring case the e.g. Nnoption is redundant. The following code seems to work and incorporates some of the above points. I've also simplified the Regex as the example data works with this Regex. #!/usr/bin/perl use strict; use warnings; open( INPUT, "<Wiki_Pulito/Prova/Pattern2.txt") or die "Can't open Pat +tern2.txt"; open( LISTAPAROLE,"<File_Input/Coppie_Parole.txt") or die "Can't open +Coppie_Parole.txt"; my %hash; while (<INPUT>) { chomp; my ($word1, $word2) = split /:/, $_; $hash{$word1} = $word2; } close INPUT; # Carico la parte di file di testo che va analizzata open( CONTEGGIO, ">Wiki_Pulito/Prova/Conteggio.txt") or die "Can't ope +n Conteggio.txt"; # Apro il file di output my $conto=0; my %arrayris; while (my $text = <LISTAPAROLE>){ for my $key (keys %hash){ my $value = $hash{$key}; if ($text =~/$key\/$key\/.con\/con\/E.$value\/$value +\/S/is){ $conto++; } my $arrkey=$key."-".$value; $arrayris{$arrkey}=$conto; } } while ( my ($k,$v) = each %arrayris ) { print CONTEGGIO "($k) => $v\n"; } close LISTAPAROLE; close CONTEGGIO; [download]	[reply] [d/l] [select]
Re^2: Problem in counting the occurrences of a string in a text file by u671296 (Sexton) on Dec 29, 2008 at 16:42 UTC
OK, a typo in the previous reply. `$conto++ if $value =~/(($key\/$key\/S)\s{0,2}(\.\)\s{0,2}(con\/con\/E +)\s{0,2}(\.\)\s{0,2}($value\/$value\/S))/is){` [download] should read `$conto++ if $text =~/(($key\/$key\/S)\s{0,2}(\.\)\s{0,2}(con\/con\/E) +\s{0,2}(\.\)\s{0,2}($value\/$value\/S))/is){` [download]	[reply] [d/l] [select]
Re^2: Problem in counting the occurrences of a string in a text file by findtheriver (Initiate) on Dec 29, 2008 at 21:20 UTC
Thanks for the comment. But if I try using your code, I get the same result as mine. Every couple has the same number of occurrences, and that just isn't possible. I'm sorry if I cannot make it more clear, is just I don't know how to explain that.	[reply]
Re^3: Problem in counting the occurrences of a string in a text file by linuxer (Curate) on Dec 29, 2008 at 21:31 UTC
I think your problem is, that you only have one counter variable, which is increased for each individual matching. Consider something like this: `#my $conto = 0; #### REMOVED; not needed my %arrayris; while (my $text = <LISTAPAROLE>){ for my $key (keys %hash){ my $value = $hash{$key}; if ($text =~/$key\/$key\/.con\/con\/E.$value\/$value\/S/is){ ### increase for each key/value pair individually $arrayris{ join '-', $key, $value }++; } } } while ( my ($k,$v) = each %arrayris ) { print CONTEGGIO "($k) => $v\n"; }` [download]	[reply] [d/l]
Re: Problem in counting the occurrences of a string in a text file by Anonymous Monk on Dec 29, 2008 at 21:42 UTC
I tried following your suggestions and working on the code, but I am not capable of make it work. What I'd like to do is count every occurrence of the given relation (con/con/E is a relation) with every couple of work I have. If in my file I have 3 occurrences of the string `computer/computer/S .* con/con/E .* processore/processore/S` I'd like for my output to be like this: `computer-processore)-->3` The new code I wrote is this one (I tried to make it more readable, and, by the way, I also fixed the fact that my input file had some line with just * and some other with .) : #!/usr/bin/perl use strict; use warnings; open my $listaParole,"File_Input/Coppie_Parole.txt" or die; my %hash; while (my $line=<$listaParole>) { chomp $line; my ($word1, $word2) = split /:/, $line; $hash{$word1} = $word2; } close $listaParole; open my $testo, "<Wiki_Pulito/Prova/Pattern2.txt" or die; open my $conteggio, ">Wiki_Pulito/Prova/Conteggio1.txt" or die; my $count=0; my %arrayris; while (my $text=<$testo>){ for my $key (keys %hash){ my $value = $hash{$key}; if ($text =~/($key\/$key\/S)\s{0,4}(\.\)\s{0,4}(con\/con\/E)\s{0 +,4}(\.\*)\s{0,4}($value\/$value\/S)\b/g){ $count++; } my $arrkey=$key."-".$value; $arrayris{$arrkey}=$count; } } while ( my ($k,$v) = each %arrayris ) { print $conteggio "($k) => $v\n"; } close $testo; close $conteggio; [download] Problem is, when I get the output, is a nice list and all, but it's not possible that every couple and relation has exactly the same number of occurrences. And that is exactly what I get, like this: `([aA]mplificator[ei]-[Tt]ransistor) => 27 ([cC]ervello-[Tt]alamo) => 27 ([Ee]ucariot[ia]-[Mm]embran[ae]) => 27 ([Cc]erio-[Ii]sotop[oi]) => 27 ([Cc]ellul[ae]-[Nn]ucle[oi]) => 27 ([Tt]ronco-[Tt]orace) => 27 ([Bb]raccio-[Aa]vambraccio) => 27` [download] It says 27 even for couple that never appear in the same string with the given relation. Thanks everyone for the help! Every suggestion is very well appreciated!	[reply] [d/l] [select]
Re^2: Problem in counting the occurrences of a string in a text file by linuxer (Curate) on Dec 29, 2008 at 21:51 UTC
Did you see and consider my reply (though its based upon u67129's code)?	[reply]
Re^3: Problem in counting the occurrences of a string in a text file by Anonymous Monk on Dec 29, 2008 at 22:00 UTC
No. I just saw and tried it. It works now, perfectly fine. Thank you so much! You have no idea how much I appreciate it!	[reply]