in reply to a question about making a word frequency matrix

Are you looking for something like this?

use warnings; use strict; my $word = qr/(?<!\p{Alnum})\p{Alnum}+(?!\p{Alnum})/; my %count; my $counter; while (my $line = <DATA>) { while ($line =~ /($word('$word)?)/g){ $count{$1}++; } } for (sort {$count{$b} <=> $count{$a} || lc $a cmp lc $b } keys %count +) { printf "%15s %5d\n", $_, $count{$_}; last if ++$counter > 100; }; __DATA__ "Hello World!" "Oh poor Yorick, his world I knew well yes I did" "don't won't, can't shouldn't, you'll, it's, etc." "Señor Montóya's resüme isn't ápropos." the, the, the, the, the, the, the, the, the, the

It isn't very clear what you mean by "words-by-words matrix".

Replies are listed 'Best First'.
Re^2: a question about making a word frequency matrix
by thundergnat (Deacon) on Dec 07, 2005 at 20:17 UTC

    Ah. You've clarified what you mean a bit.

    Ok, Here's a simple version that is limited to finding the top five (so it will fit across one standard terminal screen). Adjust $limit and redirect to a file for larger numbers.

    Not necessarily the best way, but not too bad:

    use warnings; use strict; $/ = ''; my $word = qr/(?<!\p{Alnum})\p{Alnum}+(?!\p{Alnum})/; my %count; my $paragraphs; my $counter; my @results; my $limit = 5; while ( my $line = <DATA> ) { while ( $line =~ /($word('$word)?)/g ) { $count{$1}{count}++; $count{$1}{$.}++; $paragraphs = $.; } } for ( sort { $count{$b}{count} <=> $count{$a}{count} || lc $a cmp lc $ +b } keys %count ) { last if ++$counter > $limit; push @results, $_; } print ' ' x 12; printf "|%12s", $_ for @results; print "\n"; print 'Total count:'; printf "|%12s", $count{$_}{count} for @results; print "\n"; print '-' x ( 13 * ( $limit + 1 ) ), "\n"; for my $line ( 1 .. $paragraphs ) { printf "Prgrph %4s:", $line; printf "|%12s", $count{$_}{$line} || '0' for @results; print "\n"; } __DATA__ "Hello World!" "Oh poor Yorick, his world I knew well yes I did" "don't won't, can't shouldn't, you'll, it's, etc." "Señor Montóya's resüme isn't ápropos." the, the, the, the, the, the, the, the, the, the "Hello World!" "Oh poor Yorick, his world I knew well yes I did" "don't won't, can't shouldn't, you'll, it's, etc." "Señor Montóya's resüme isn't ápropos." the, the, the, the, the, the, the, the, the, the "Hello World!" "Oh poor Yorick, his world knew well yes did" "don't won't, can't shouldn't, you'll, it's, etc." "Señor Montóya's resüme isn't ápropos." the, the, the, the, the, the, the, the, the, the
Re^2: a question about making a word frequency matrix
by peacekorea (Novice) on Dec 07, 2005 at 19:45 UTC
    Sorry for my ambiguous question.. My question is corrected just now. Thank you.
Re^2: a question about making a word frequency matrix
by ambrus (Abbot) on Dec 07, 2005 at 23:07 UTC

    Just let me note that if you don't define the encoding of the the filehandle you're reading from (DATA here) then the strings you read in will be byte strings and then matching a unicode class such as /\p{Alnum}/ won't make much sense on them. In this case, perl will act as if the string would be iso_8859_1-encoded. (You can call this a bug or a feature.) This might not work with text of a different encoding, such as iso_8859_2. It will accidentally work with Hungarian text encoded as iso_8859_2, as the only Hungarian letters not in 8859_1 are \x{151}, \x{150}, \x{171}, \x{170} which are in positions \xf5, \xd5, \xfb, \xdb, which are letters (although different letters) in 8859_1. However, other languages use letters such as \x{15b}, which is encoded to 8859_2 as \xb6, and that's a non-alnum symbol in 8859_1. With other encodings, such as utf-8, you'll probably have even more serious failures.

    If you want to match letters in non-ascii texts, you have two options. One is to set the encoding of the filehandle with either binmode, 3-arg open open the encoding pragma, the -C command line option, the PERLIO env-var, or some other way; or decode the string with the Encode module after reading. The other is to stay with byte string, set the correct locale with the environment variables (the locale has information about the character set, like what chars are alphabetic etc), use locale; to make the matching locale-aware, and match for /\w/ or /[[:alnum:]]/

    Update: for peacekorea: please don't let this discussion confuse or frighten you, it's not quite important for the original goal. I'd just like to spread information about internationalization for the Americian monk who thinks naïvely thinks other languages all use 8859_1 just a handful of accented letters.

      Quote:
      I'd just like to spread information about internationalization for the Americian monk who thinks naïvely thinks other languages all use 8859_1 just a handful of accented letters.

      Well, that was a pretty far leap. It's true, if you try to read files with this script that aren't in the encoding it expects, as it is written, you will almost certainly end up with wrong results. Perhaps I should have mentioned that. But in looking through my post, I can't find the spot where I say "This is the best and only way to do this, and it will deal all possible data sets without modification."

      It was a 5 minute throw away script I just tossed off to give an idea of how the problem could be approached. Sorry I can't live up to the high standards of ambrus who naively believes that every quick and dirty one-off script should be perfect in every way and cover every eventuality.

        I'd just like to spread information about internationalization for the Americian monk who thinks naïvely thinks other languages all use 8859_1 just a handful of accented letters.

        I apologize. This was indeed a bit rude of me:

        I admit this encoding problem is just a minor nit, and that it's not central to the problem of the OP. There're just one reason why I mentioned it: you included accented characters to your examples.

        It was a 5 minute throw away script I just tossed off to give an idea of how the problem could be approached. Sorry I can't live up to the high standards of ambrus who naively believes that every quick and dirty one-off script should be perfect in every way and cover every eventuality.

        Very true. I often fall to this mistake.