in reply to Re: Counting occurence of a list of word in a file
in thread Counting occurence of a list of word in a file

You're totally right!
I need to know how many times each word in the first column of the list occour in my text file. Let's say I have this two couple:
[Aa]tom[oi]:[Nn]ucle[oi] [Nn]ucle[oi]:[Pp]roton[ei]
and this text here:
 In particolare, l ' atomo √® composto da un nucleo carico positivamente e da un certo numero di elettroni, carichi negativamente, che gli vibrano attorno senza un ' orbita precisa (l ' elettrone si dice infatti delocalizzato), nei cosiddetti gusci elettronici. Il nucleo √® composto da protoni, che sono particelle cariche positivamente e da neutroni che sono particelle prive di carica: protoni e neutroni sono detti nucleoni. In proporzione se si considera il nucleo grande come una mela, gli elettroni gli ruotano attorno ad una distanza pari a circa un chilometro; viceversa un nucleone ha massa quasi 1800 volte superiore a quella di un elettrone. I'd like it to count how many times Aatomoi occours in this and how many times does Nnucleoi.
The [] stuff is beacuse I have another script that search for this words in a text and I need for it to retrieve all the possible words made with this char: Nucleo, nucleo, Nuclei, nuclei.
I hope I was more clear

Replies are listed 'Best First'.
Re^3: Counting occurence of a list of word in a file
by gone2015 (Deacon) on Nov 11, 2008 at 17:43 UTC

    OK. So the core of what you have:

    while (my $text=<$testo>){ for my $key (keys %hash){ my $value = $hash{$key}; my $arrkey=$key." "; my $count = 0; $count += () = /\b$key\b/ig while <>; print $conteggio "$arrkey) => $count\n"; } ; } ;
    reads the file line by line into $text, and then tries each column 1 word (captured earlier as the key values in your %hash). The regex /\b$key\b/ig is plausible, and the [oi] stuff will do what you want -- the [Nn] will also work, but are redundant because of the i qualifier of the regex.

    The rest is, frankly, a dogs breakfast and can be thrown away.

    To count the number of times you get a match in each line,

    my $count = () = $text =~ /\b$key\b/ig ;
    is sufficient, but fairly deep magic. This:
    my $count = 0 ; while ($text =~ /\b$key\b/ig) { $count++ ; } ;
    may or may not seem clearer.

    Now your problem is how to collect the count for each word across all the lines of your input. I suggest using the value part of your hash entries to hold the count for the word in the key part.

    When the while loop has finished, your hash should contain the count for each word, which you can then output to $coteggio.

      thanks! It's working fine now! I'll post the code I used, so you all can give me suggestion for improving my programming skills (I'm sorry if they're not good, but I've been using Perl for just a few week, for my Thesis, and it's all really new), or maybe it could be useful for someone with the same problem..
      open my $testo, "<File_Input/Testo.txt"; open my $conteggio, ">File_Output/Conteggio.txt"; my %arrayris; while (my $text=<$testo>){ for my $key (keys %hash){ my $value = $hash{$key}; my $count = 0 ; while ($text =~ /\b$key\b/ig) { $count++ ; } ; $arrayris{$key}=$count; } } while ( my ($k,$v) = each %arrayris ) { print $conteggio "($k) => $v\n"; } close $testo; close $conteggio;

      Thanks again, to everybody!

        It's an improvement !

        If your File_Input/Testo.txt file contains more than one line, then I suggest

        $arrayris{$key} += $count ;
        will produce a more complete result. (Perl will happily create an hash entry with (effectively) a zero value when required.)

        You could also consider counting directly in your %arrayris:

        while ($text =~ /\b$key\b/ig) { $arrayris{$key}++ ; } ;

        Other things you might consider:

        • why you read the words into a hash (your %hash) when you only really use the keys... it may be preparatory to some future extension, I cannot tell.

        • similarly what is:

          my $value = $hash{$key};
          doing to justify its existence.

        • recommend use strict ; and use warnings ; -- they will help you keep out of trouble !

        • the code is definitely "quick and dirty". If your either your wordlist or your input are very long, you may want to speed things up... But, the first rule of optimisation is: Don't do it (unless you really have to).

Re^3: Counting occurence of a list of word in a file
by toolic (Bishop) on Nov 11, 2008 at 18:13 UTC
    If you can not figure out what some code is doing, you could break it up into smaller, isolated chunks (i.e., subs):
    use strict; use warnings; my %patts; my %counts; my $file = 'patterns.txt'; my $fh; open $fh, '<', $file or die "can not open file $file: $!"; while (<$fh>) { chomp; my ($word1, $word2) = split /:/; $patts{$word1} = $word2; } close $fh; $file = 'text.txt'; open $fh, '<', $file or die "can not open file $file: $!"; while (<$fh>) { chomp; for my $patt (keys %patts) { $counts{$patt} += count_match($_, $patt); } } close $fh; for my $patt (keys %counts) { print "pattern $patt occurs $counts{$patt} times\n"; } sub count_match { my ($str, $regex) = @_; my @words = split /\s+/, $str; my $count = 0; for (@words) { $count++ if /\b$regex\b/ } return $count; } __END__ pattern [Aa]tom[oi] occurs 1 times pattern [Nn]ucle[oi] occurs 3 times