peacekorea has asked for the wisdom of the Perl Monks concerning the following question:

* Sorry. This question is revised just now. Thank you very much! Hi, Thank you for your interest in my question. data is a text file (*.txt). the result would seem like this.
             I  you  find  WMD  iraq  Holy war ....
paragraph 1      3     7    10
paragraph 2            10        5     7
paragraph 3            5               
paragraph 4  7
paragraph 5                                 10
paragraph 6
The first row is top 100 frequent words (in the text file). Numbers in each cell shows the frequency of the word in the data. The data is one text file (.txt) with more than 100000 words. Each "paragraph" is delimited by "-100". I need a matrix which shows the frequency of each word in each paragraph. I thought it is words-by-words matrix, but it's wrong. It's words-by-paragraph matrix. Thank you very much! Peace of the world.
  • Comment on a question about making a word frequency matrix

Replies are listed 'Best First'.
Re: a question about making a word frequency matrix
by Old_Gray_Bear (Bishop) on Dec 07, 2005 at 19:19 UTC
    Based on your statement in the CB that this is *not* homework -- A little more information is needed:

    • A simple tabulation to produce the top 100 is a different beast from from a 'word by word matrix' (correlation frequencies?).
    • How are you getting these words -- from a document (in which case the correlation matrix idea makes sense), or from a file (and if it's one work per line, sort() and uniq() are your fields).

    ----
    I Go Back to Sleep, Now.

    OGB

      This message is deleted for preventing confusion. Thank you.
Re: a question about making a word frequency matrix
by ikegami (Patriarch) on Dec 07, 2005 at 19:37 UTC

    A solution on which you can build:

    sub max2 { $_[0] > $_[1] ? $_[0] : $_[1] } my $filename = ...; my %ignore = map { lc($_) => 1 } qw( a an are i if in is it m on re s to the ... ); open(my $fh_in, '<', $filename) or die("Unable to open input file: $!\n"); my %counts; while (<$fh_in>) { $_ = lc($_); while (/([a-z]+)/g) { next if $ignore{$1}; ++$counts{$1}; } } my @words_ordered = sort { $counts{$b} <=> $counts{$a} } keys %counts; foreach (0 .. max2(99, $#words_ordered)) { print("Word $_ was found $counts{$_} times.\n"); }

    It's simplistic! For example, "It's Jeff" is broken down into "it", "s" and "jeff".

    Update: Added %ignore. Needs better. Maybe we could automatically ignore words of length less than 4 unless they were originally uppercase.

Re: a question about making a word frequency matrix
by thundergnat (Deacon) on Dec 07, 2005 at 19:38 UTC

    Are you looking for something like this?

    use warnings; use strict; my $word = qr/(?<!\p{Alnum})\p{Alnum}+(?!\p{Alnum})/; my %count; my $counter; while (my $line = <DATA>) { while ($line =~ /($word('$word)?)/g){ $count{$1}++; } } for (sort {$count{$b} <=> $count{$a} || lc $a cmp lc $b } keys %count +) { printf "%15s %5d\n", $_, $count{$_}; last if ++$counter > 100; }; __DATA__ "Hello World!" "Oh poor Yorick, his world I knew well yes I did" "don't won't, can't shouldn't, you'll, it's, etc." "Señor Montóya's resüme isn't ápropos." the, the, the, the, the, the, the, the, the, the

    It isn't very clear what you mean by "words-by-words matrix".

      Ah. You've clarified what you mean a bit.

      Ok, Here's a simple version that is limited to finding the top five (so it will fit across one standard terminal screen). Adjust $limit and redirect to a file for larger numbers.

      Not necessarily the best way, but not too bad:

      use warnings; use strict; $/ = ''; my $word = qr/(?<!\p{Alnum})\p{Alnum}+(?!\p{Alnum})/; my %count; my $paragraphs; my $counter; my @results; my $limit = 5; while ( my $line = <DATA> ) { while ( $line =~ /($word('$word)?)/g ) { $count{$1}{count}++; $count{$1}{$.}++; $paragraphs = $.; } } for ( sort { $count{$b}{count} <=> $count{$a}{count} || lc $a cmp lc $ +b } keys %count ) { last if ++$counter > $limit; push @results, $_; } print ' ' x 12; printf "|%12s", $_ for @results; print "\n"; print 'Total count:'; printf "|%12s", $count{$_}{count} for @results; print "\n"; print '-' x ( 13 * ( $limit + 1 ) ), "\n"; for my $line ( 1 .. $paragraphs ) { printf "Prgrph %4s:", $line; printf "|%12s", $count{$_}{$line} || '0' for @results; print "\n"; } __DATA__ "Hello World!" "Oh poor Yorick, his world I knew well yes I did" "don't won't, can't shouldn't, you'll, it's, etc." "Señor Montóya's resüme isn't ápropos." the, the, the, the, the, the, the, the, the, the "Hello World!" "Oh poor Yorick, his world I knew well yes I did" "don't won't, can't shouldn't, you'll, it's, etc." "Señor Montóya's resüme isn't ápropos." the, the, the, the, the, the, the, the, the, the "Hello World!" "Oh poor Yorick, his world knew well yes did" "don't won't, can't shouldn't, you'll, it's, etc." "Señor Montóya's resüme isn't ápropos." the, the, the, the, the, the, the, the, the, the
      Sorry for my ambiguous question.. My question is corrected just now. Thank you.

      Just let me note that if you don't define the encoding of the the filehandle you're reading from (DATA here) then the strings you read in will be byte strings and then matching a unicode class such as /\p{Alnum}/ won't make much sense on them. In this case, perl will act as if the string would be iso_8859_1-encoded. (You can call this a bug or a feature.) This might not work with text of a different encoding, such as iso_8859_2. It will accidentally work with Hungarian text encoded as iso_8859_2, as the only Hungarian letters not in 8859_1 are \x{151}, \x{150}, \x{171}, \x{170} which are in positions \xf5, \xd5, \xfb, \xdb, which are letters (although different letters) in 8859_1. However, other languages use letters such as \x{15b}, which is encoded to 8859_2 as \xb6, and that's a non-alnum symbol in 8859_1. With other encodings, such as utf-8, you'll probably have even more serious failures.

      If you want to match letters in non-ascii texts, you have two options. One is to set the encoding of the filehandle with either binmode, 3-arg open open the encoding pragma, the -C command line option, the PERLIO env-var, or some other way; or decode the string with the Encode module after reading. The other is to stay with byte string, set the correct locale with the environment variables (the locale has information about the character set, like what chars are alphabetic etc), use locale; to make the matching locale-aware, and match for /\w/ or /[[:alnum:]]/

      Update: for peacekorea: please don't let this discussion confuse or frighten you, it's not quite important for the original goal. I'd just like to spread information about internationalization for the Americian monk who thinks naïvely thinks other languages all use 8859_1 just a handful of accented letters.

        Quote:
        I'd just like to spread information about internationalization for the Americian monk who thinks naïvely thinks other languages all use 8859_1 just a handful of accented letters.

        Well, that was a pretty far leap. It's true, if you try to read files with this script that aren't in the encoding it expects, as it is written, you will almost certainly end up with wrong results. Perhaps I should have mentioned that. But in looking through my post, I can't find the spot where I say "This is the best and only way to do this, and it will deal all possible data sets without modification."

        It was a 5 minute throw away script I just tossed off to give an idea of how the problem could be approached. Sorry I can't live up to the high standards of ambrus who naively believes that every quick and dirty one-off script should be perfect in every way and cover every eventuality.

Re: a question about making a word frequency matrix
by GrandFather (Saint) on Dec 07, 2005 at 20:54 UTC

    This may get you started. Note that it does not find the top 100 words, it counts all words over 4 characters long.

    use strict; use warnings; use Text::ExtractWords qw(words_count); my %words; my @paraStats; local $/ = '-100'; while (<DATA>) { chomp; s/[\n\r]/ /g; my %paraWordFreq; words_count(\%paraWordFreq, $_, minwordlen => 4); push @paraStats, {%paraWordFreq}; map {$words{$_} = 0} keys %paraWordFreq; } my $col = 0; print ' ' x 17; for (sort keys %words) { next if 4 > length $_; $words{$_} = ($col += 1 + length $_); print "$_ "; } my $paraNum = 1; for (@paraStats) { printf "\nParagraph %5d: ", $paraNum++; my $col = 0; my %paraWords = %$_; for (sort keys %paraWords) { next if 4 > length $_; printf "%*d ", $words{$_} - $col - 1, $paraWords{$_}; $col = $words{$_}; } } __DATA__ The first row is top 100 frequent words (in the text file). Numbers in + each cell shows the frequency of the word in the data. The data is one text file + (.txt) with more than 100000 words. -100 Each "paragraph" is delimited by "-100". I need a matrix which shows t +he frequency of each word in each paragraph. I thought it is words-by-wor +ds matrix, but it's wrong. It's words-by-paragraph matrix. Thank you very much! P +eace of the world.

    Prints (note that the lines are long:

    100000 cell data delimited each file first frequency +frequent it's matrix more much need numbers paragraph peace shows tex +t than thank thought very which with word words words-by-paragraph wo +rds-by-words world wrong Paragraph 1: 1 1 2 1 2 1 1 + 1 1 1 1 +2 1 1 1 2 Paragraph 2: 1 1 + 1 Paragraph 3: 2 1 + 2 3 1 1 1 1 1 + 1 1 1 1 1 1 + 1 1 1

    DWIM is Perl's answer to Gödel
Re: a question about making a word frequency matrix
by samizdat (Vicar) on Dec 07, 2005 at 20:10 UTC
    Since you are writing your output line by line, you don't need to store a matrix. You need a hash which maintains the number of times the word is found in a given paragraph. Therefore,
    • open the output file
    • write the key line as an ordered set (array) with tab (\t) characters in appropriate places
    • make a hash of all 100 keys with zero as the value
    • then, for each paragraph,
      1. read from input file into a string until you find '-100'
      2. see if each character group (until space, newline, or punctuation) is a key
      3. if so, increment the hash value of that key. If not, ignore
      4. write out the values to the output file as an ordered set
      5. reset the hash values to zero
    • close the files
    You'll want to deal with capitalization somehow, as well. :D

    Don Wilde
    "There's more than one level to any answer."
Re: a question about making a word frequency matrix
by Roy Johnson (Monsignor) on Dec 07, 2005 at 19:45 UTC
    The language is Perl, and the interpreter program is perl. PERL is someone speaking too loudly because they don't know the language. You are now initiated into the inner circle. :-)

    As a very general rule, post some code so that we can see what part of the program you are really having trouble with. If you've never written anything at all in Perl, you should probably start with a Tutorial introduction. There are scads of them available.

    You should know how to read a file line by line, which is the usual approach in Perl, unless there's some reason to do something else. You'll then want to extract every word from a line, which is a basic regular expression task, depending on how you define a word. perldoc perlretut will give you ideas there.

    For more specific advice on this task, you might find How do I count the frequency of words in a file and save them for later? in the Categorized Questions and Answers section to be helpful.


    Caution: Contents may have been coded under pressure.
      Thank you for your comment. I'm actually studying Hashes of Arrays after my friend's comment... However, i thought I am too far away to approach to realizing my purpose.. That's why I asked monks about more concrete way.. I really appreciate for many helps from you.
Re: a question about making a word frequency matrix
by ambrus (Abbot) on Dec 07, 2005 at 19:53 UTC

    I think you need three separate things for this.

    Firstly, you have to parse words out of the text. This depends a bit on what you count a word (which depends both on the language and your intent). Some months ago I've posted an example at Re: stripped punctuation.

    For the words by paragraph matrix, I think you'd need a hashes of array structures. You can find some examples in perldoc perldsc.

    You also need to find out the 100 most frequent words. After you've calculated the frequency of every word, you have multiple ways to do this. With a longer text, there is a fast solution to find them at Re: Puzzle: The Ham Cheese Sandwich cut.. However, it's simpler and almost as fast to use a heap: you have to insert the frequencies of the first 100 words to a heap then insert the frequency of each word and pop the least number from the heap alternatingly. You could use one of the CPAN modules Heap and Heap::Simple (but remember, just beacuse the name of a module is Simple or Light or Lite, it isn't neccessarily simpler to use than other modules). Or you can adapt my script at Re: Re: Re: Re: Sorting values of nested hash refs, which is simpler then a generic heap module, as it doesn't include the algorithm to pop a value from a heap. Since that, I've written a better heap implentation which I'll have to post sometimes. (Update: posted it, see Binary heap.) Using a heap has the advantage that it can give you the 100 most frequent words sorted by frequency.

    However, the simplest solution is to sort all the different words by frequency (as I guess others will recommend). This might be slower than the other solutions but still not very slow, especially because sort is a builtin perl function.

Re: a question about making a word frequency matrix
by l3v3l (Monk) on Dec 08, 2005 at 00:16 UTC
    I know this is not a working solution to your query but is a one-liner that I have used often to get this type of information quickly for any block of text I am dealing with:
    perl -nle '$c{$_}++ for split/\s/;}print map {"$_:$c{$_}\n"} sort{$c{$ +b}<=>$c{$a}}keys %c;{' file_of_WHATEVER_to_count.txt
    example, to count the number of instances of each word in a file you could use split/\W/ in the place of split/WHATEVER/ or just split if you wanted to keep punctuation,etc. intact. output is ordered by most frequent occurrences (at the top of the "item:count\n" listing)
Re: a question about making a word frequency matrix
by TedPride (Priest) on Dec 08, 2005 at 11:50 UTC
    The following does what you want. Note that I've excluded all 1 and 2-letter words, plus and and the. You may wish to add to the exclusion list. Also note that I have not provided for the situation where a paragraph word count goes over 999 for a 3-letter word. I didn't feel this was necessary.
    use strict; use warnings; my (%count, @pcount, @words, @values, $format, $i); my @remove = qw/and the/; # Words to exclude my $width = 10; # Number of words to display $/ = '-100'; # Line delimiter while (<DATA>) { my %pcount; $_ = lc $_; while (m/[a-z]+('[a-z]+)?/g) { next if length($&) < 3; $pcount{$&}++; $count{$&}++; } push @pcount, \%pcount; } for (@remove) { delete $count{$_} if exists $count{$_}; } @words = sort {$count{$b} <=> $count{$a} || $a cmp $b} keys %count; $#words = $width - 1 if $#words > $width - 1; $format = '%' . length($#pcount+1) . 's'; $format .= '%' . (length($_) + 2) . 's' for @words; print ' 'x10, sprintf($format, '', @words), "\n"; for ($i = 0; $i <= $#pcount; $i++) { my @values = $i + 1; push @values, $pcount[$i]{$_} for @words; no warnings; print 'paragraph ', sprintf($format, @values), "\n"; } push @values, $count{$_} for @words; print "\n", ' 'x10, sprintf($format, '', @words), "\n"; print ' 'x10, sprintf($format, '', @values), "\n"; __DATA__ So it was into a neighborhood bursting with rumors and resentment that + Gopi, the ten year old cousin who had been brought down from the vil +lage to be the household help, stepped out to do his daily chores. Hi +s responsibilities included:-100 1. Carrying the copper tray for the old lady and trotting behind her a +t the proper pace when she went out to do her morning prayers at five + am in the morning.-100 2. Bringing the wood, the coal, and the kindling so that the daughter- +in-law could light the fire.-100 3. Bringing water from the well to the fifth floor, where the kitchen +was located.-100 4. Cutting the vegetables, cleaning the rice, soaking the lentils, she +lling the peas and any other sundry time-consuming tasks that arose i +n a kitchen with a mortar and pestle and precious little else.-100
    Output:
    that bringing from her his kitchen morning old paragraph 1 1 1 2 1 paragraph 2 2 2 1 paragraph 3 1 1 paragraph 4 1 1 1 paragraph 5 1 1 that bringing from her his kitchen morning old 3 2 2 2 2 2 2 2
Re: a question about making a word frequency matrix
by planetscape (Chancellor) on Dec 08, 2005 at 08:06 UTC

    Not precisely what you asked for, but since you appear to be in danger of reinventing wheels, I thought I'd point out that there are many mature packages available for handling NLP-related tasks. You may see planetscape's scratchpad for some; don't neglect Super Search either, which you can use to turn up things like Ted Pedersen's Ngram Statistics Package by searching for "ngram" or "word frequency"...

    HTH,

    planetscape