lechateau has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone! Just to make everything clear. I need this for a project were I am being graded. However, the project is not about Perl. It is about doing statistics with unstructured data, ie text. I can do this on excel manually, but I think it would be nice to have a code that will generalyze my analysis and algorithm to any collection of texts. The final goal is to create a process to classify text and information, with no human interaction. That is what I being graded on and I do not need help on that. Just perl... counting words in text and stuff like that. Here is my question Given this code
#! perl -w $filename = "tryit.txt"; open(IN, $filename) || die; my %freq; my @title; # array of titles my $story; # number of current story while(<IN>) { if(/^\<(.*)\>\s*$/) { # It's a title push @title, $1; $story = $#title; } elsif (defined $story) { # It's plain text s/[\.,:;\?"!\(\)\[\]\{\}(--)_]//g; foreach my $word (/\w+/g) { $freq{lc $word}[$story]++; } } } # print "\n\nOutput tab delimited text file:\n\n"; { local($\, $,) = ("\n", "\t"); print '', @title; foreach my $row (sort keys %freq) { print $row, map $_ || '', @{$freq{$row}}[0 .. @title-1] } } close IN
this code takes a multiple pieces of text and creates a table with the words that appear on each story as rows and story titles as columns. Then each "cell" counts the number of times each word appears on each story So for example Story One: Perl is great Story Two: Perl is free perl Story three: Will I learn perl? will return:
story1 Story2 Story3 Perl 1 2 1 is 1 1 great1 free 1 will 1 i 1 learn 1
NOw in order to do what I need to accomplish my final task i need to sum rows, that is for example: how many times does the word perl appears on the stories? then I need to sum colums, how may words does story one have?, And finally I need to find out how many words do stories 1 and 2 or 3 have in common. I know I could take the output and do this on excel, however i need to hand in perl code.... Thank you!!

Replies are listed 'Best First'.
Re: Add colums and rows
by apl (Monsignor) on Apr 22, 2008 at 01:10 UTC
      I guess Trivial is a relative word! Thanks for your help...
Re: Add colums and rows
by tachyon-II (Chaplain) on Apr 22, 2008 at 03:51 UTC

    It would be better to use a better data structure, updating the word counts and totals as you go. Using your existing data structure you have to iterate through everything to get the answers. Seeing you effectively iterated through everything in the first place you could have generated this data at the same time.

    Although it may surprise you a hash of hashes might well have been more space efficient. Say you have 100 stories and in the last story you have an unusual word. Perl has to allocate 100 array slots to put the word count for this word at the correct offset. All the preceding elements spring into existence with value undef.

    # $freq{lc $word}[$story]++; my @word_count; my $common_to_all; for my $word( keys %freq ) { my $total = 0; my $common = 0; for my $story( 0..$#title ) { next unless $freq{$word}[$story]; $common++; $total += $freq{$word}[$story]; $word_count[$story] += $freq{$word}[$story]; } printf "\n%-10s %d", $word, $total; if ($common == @title) { $common_to_all++; print " (common to all)" ; } } print "\nStory $_ has $word_count[$_] words." for 0..$#title; print "\nThe stories have $common_to_all words in common.\n";
Re: Add colums and rows
by GrandFather (Saint) on Apr 22, 2008 at 02:40 UTC

    It may help if you clarify what it is that you actually want. Is it the summary data only, the current table with summary data, or something else you haven't mentioned.

    It's also appreciated if you indicate up front that you are seeking help with homework ("however i need to hand in perl code"), although CB comments indicate that this may not be exactly homework ("need to do this for a thesis ... no perl, perl is just a way"). Any any case a little background on what you are trying to achieve can save a lot of time all round.

    Update: The following may be what you are after:

    use strict; use warnings; my %freqs; my $story; # Current story name while (<DATA>) { if (/^\[(.*)\]\s*$/) { $story = ucfirst $1; # Force title caps die "Duplicate story title: $story" if exists $freqs{$story}; next; } next unless defined $story; # wait until we have a story title for my $word (/\w+/g) { # Current story counts $word = ucfirst $word; $freqs{$word}{$story}++; $freqs{all}{$story}++; # Total counts $freqs{$word}{total}++; $freqs{all}{total}++; } } # Print title line print "\t", (join "\t", sort keys %{$freqs{all}}), "\n"; # Print table for my $word (sort keys %freqs) { $freqs{$word}{$_} ||= 0 for keys %{$freqs{all}}; printf "$word\t"; print join "\t", join "\t", map $freqs{$word}{$_}, sort keys %{$fr +eqs{$word}}; print "\n"; } __DATA__ [para one] Hello everyone! Just to make everything clear. I need this for a proje +ct were I am being graded. However, the project is not about Perl. It is about d +oing statistics with unstructured data, ie text. I can do this on excel man +ually, but I think it would be nice to have a code that will generalyze my analys +is and algorithm to any collection of texts. The final goal is to create a pr +ocess to classify text and information, with no human interaction. That is what + I being graded on and I do not need help on that. Just perl... counting words +in text and stuff like that. Here is my question Given this code [para two] this code takes a multiple pieces of text and creates a table with the + words that appear on each story as rows and story titles as columns. Then ea +ch "cell" counts the number of times each word appears on each story So for exam +ple Story One: Perl is great Story Two: Perl is free perl Story three: Will I le +arn perl? will return:

    Prints:

    Para one Para two total A 3 2 5 About 2 0 2 Algorithm 1 0 1 Am 1 0 1 Analysis 1 0 1 And 4 2 6 Any 1 0 1 Appear 0 1 1 Appears 0 1 1 As 0 2 2 Be 1 0 1 Being 2 0 2 But 1 0 1 Can 1 0 1 Cell 0 1 1 Classify 1 0 1 Clear 1 0 1 Code 2 1 3 Collection 1 0 1 Columns 0 1 1 Counting 1 0 1 Counts 0 1 1 Create 1 0 1 Creates 0 1 1 Data 1 0 1 Do 2 0 2 Doing 1 0 1 Each 0 4 4 Everyone 1 0 1 Everything 1 0 1 Example 0 1 1 Excel 1 0 1 Final 1 0 1 For 1 1 2 Free 0 1 1 Generalyze 1 0 1 Given 1 0 1 Goal 1 0 1 Graded 2 0 2 Great 0 1 1 Have 1 0 1 Hello 1 0 1 Help 1 0 1 Here 1 0 1 However 1 0 1 Human 1 0 1 I 6 1 7 Ie 1 0 1 In 1 0 1 Information 1 0 1 Interaction 1 0 1 Is 5 2 7 It 2 0 2 Just 2 0 2 Learn 0 1 1 Like 1 0 1 Make 1 0 1 Manually 1 0 1 Multiple 0 1 1 My 2 0 2 Need 2 0 2 Nice 1 0 1 No 1 0 1 Not 2 0 2 Number 0 1 1 Of 1 2 3 On 3 2 5 One 0 1 1 Perl 2 4 6 Pieces 0 1 1 Process 1 0 1 Project 2 0 2 Question 1 0 1 Return 0 1 1 Rows 0 1 1 So 0 1 1 Statistics 1 0 1 Story 0 6 6 Stuff 1 0 1 Table 0 1 1 Takes 0 1 1 Text 3 1 4 Texts 1 0 1 That 4 1 5 The 2 2 4 Then 0 1 1 Think 1 0 1 This 3 1 4 Three 0 1 1 Times 0 1 1 Titles 0 1 1 To 5 0 5 Two 0 1 1 Unstructured 1 0 1 Were 1 0 1 What 1 0 1 Will 1 2 3 With 2 1 3 Word 0 1 1 Words 1 1 2 Would 1 0 1 all 114 63 177

    Perl is environmentally friendly - it saves trees
      Thank you, this gets me closer to what I need!! I guess I will need two more things.... How do I print the table to a file. I can open a file handle but I dont know how to print the entire table to a file, this far i only figured out how to print lines... (perl baby steps!) The next thing, can I modify this to count the shared words between stories, too? Thanks!
        i can open a file handle but I dont know how to print the entire table to a file

        Let's assume you opened filehandle REPORT. The following code from Grandfather

        for my $word (sort keys %freqs) { $freqs{$word}{$_} ||= 0 for keys %{$freqs{all}}; printf "$word\t"; print join "\t", join "\t", map $freqs{$word}{$_}, sort keys % +{$freqs{$word}}; print "\n"; }

        could be changed to:

        for my $word (sort keys %freqs) { $freqs{$word}{$_} ||= 0 for keys %{$freqs{all}}; printf REPORT "$word\t"; print REPORT join "\t", join "\t", map $freqs{$word}{$_}, sort + keys %{$freqs{$word}}; print REPORT "\n"; }
        The next thing, can I modify this to count the shared words between stories, too?
        Certainly. Will you be reading several stories from the same DATA block? Then you don't need to change a thing.

        Will you change the program to read from several different input files? How do you want to give those filenames to your program?

        Thanks everyone that helped me to get this done. I was finally able to do the shared word count on my own! Thanks again for all your time and help! And apl Trivial is a very relative term!!! I spent 20 hours to get the counting piece done. :)
        $filename = "tryit.txt"; open(IN, $filename) || die; open(OUT, ">test1.csv") || die; open(OUT1, ">test2.csv") || die; my %freqs; my $story; # Current story name while (<IN>) { if (/^\<(.*)\>\s*$/) { $story = ucfirst $1; die "Duplicate story title: $story" if exists $freqs{$story}; next; } next unless defined $story; # wait until we have a story title s/[\.,:;\?"!\(\)\[\]\{\}(--)_]//g; for my $word (/\w+/g) { # Current story counts $word = ucfirst $word; $freqs{$word}{$story}++; $freqs{all}{$story}++; # Total counts $freqs{$word}{total}++; $freqs{all}{total}++; } } # Print title line print OUT "\t", (join "\t", sort keys %{$freqs{all}}), "\n"; # Print table for my $word (sort keys %freqs) { $freqs{$word}{$_} ||= 0 for keys %{$freqs{all}}; printf OUT "$word\t"; print OUT join "\t", join "\t", map $freqs{$word}{$_}, sort keys %{ +$freqs{$word}}; print OUT "\n"; } @info=sort keys %{$freqs{all}}; my @countop; for($i=0;$i<scalar(@info)-1;$i=$i+1) { $story1=$info[$i]; for($j=0;$j<scalar(@info)-1;$j=$j+1) { $story2=$info[$j]; for my $word (sort keys %freqs) { $freqs{$word}{$_} ||= 0 for keys %{$freqs{all}}; if($freqs{$word}{$story1} > 0){ if($freqs{$word}{$story2} > 0){ $countop[$i][$j]++;}} } } } my $m; my $n; for($m=0;$m<scalar(@info)-1;$m=$m+1) { for($n=0;$n<scalar(@info)-1;$n=$n+1) { printf OUT1 "$info[$m],$info[$n],$countop[$m][$n] \n"; } } close IN; close OUT; close OUT1;