in reply to Re: Word Frequency in Particular Sentences
in thread Word Frequency in Particular Sentences

Let me 'fess up. I was hoping that a similar problem has been solved already by someone and I could simply adapt that. As a well aged academic economist, I am way past trying to master Perl at any deep level. Still I will be grateful if you (or someone) could assure me that this is a doable problem in Perl and maybe point to a few functions/regular expressions(?) that may be used in this case. Thanks.
  • Comment on Re^2: Word Frequency in Particular Sentences

Replies are listed 'Best First'.
Re^3: Word Frequency in Particular Sentences
by roboticus (Chancellor) on Mar 28, 2008 at 03:51 UTC
    OK ... here's a small code example to get you started. (You'll still want to hit CPAN for a PDF parsing module, though.)

    #!/usr/bin/perl -w use strict; use warnings; # Tell perl to split records on periods. $/ = '.'; my %words; # Read successive lines from our __DATA__section while (<DATA>) { # Skip the sentence unless it contains the text "asia" next unless m/asia/i; # Remove extraneous characters tr/a-zA-Z/ /cs; # Show each sentence we keep print "<$_>\n"; # Increment the counter for each word found map { $words{$_}++ } split; } print "\n\n" . "Count Word\n" . "----- -------------\n"; # Print all words in the sentences that appear more than once. for (sort keys %words) { next unless $words{$_} > 1; print "$words{$_}\t$_\n"; } __DATA__ Now is the time for all good. Men to come to the Asia of their party. The quick red fox jumped over the calico cat. One fish two fish asiatic fish blue fish. Zoom. When must we come to asia to see the fox? Dolum ipsum dolor est. Canem homo mordet. I would guess that few people speak latin in Asia. Perhaps many more asians speak greek. But how would I know?
    When run on my machine, it gives us:

    roboticus~ $ ./re_test.pl < Men to come to the Asia of their party > < One fish two fish asiatic fish blue fish > < When must we come to asia to see the fox Dolum ipsum dolor est > < I would guess that few people speak latin in Asia > < Perhaps many more asians speak greek > Count Word ----- ------------- 2 Asia 2 come 4 fish 2 speak 2 the 4 to roboticus~ $
    ...roboticus
Re^3: Word Frequency in Particular Sentences
by nefigah (Monk) on Mar 28, 2008 at 06:25 UTC

    And here is some code for getting the text out of a PDF, using an excellent little CPAN module called CAM::PDF. (If you don't know how to install CPAN modules, just ask).

    This goes through a PDF page-by-page, grabbing the text, and then saves it all to a text file. Note that if your PDF is huge you may want to modify this to do it in chunks (the 367 page PDF I tested it on only took a few seconds, though).

    #!/usr/bin/perl + use warnings; use strict; use CAM::PDF; my $pdf_path = $ARGV[0] or die "No pdf specified"; my $pdf = CAM::PDF->new($pdf_path); my $text = ''; for my $page (1..$pdf->numPages) { $text .= $pdf->getPageText($page); } open my $file, '>', 'pdftext.txt'; print $file $text; close $file;


    I'm a peripheral visionary... I can see into the future, but just way off to the side.

      Thanks to all of you for your help. I appreciate it. Two quick comments-- (i) Regarding the abbreviation problem, a quick manual scan indicates that all Asian sentences are (thankfully) bound by a period. (ii) How do you, in fact, install CPAN modules? I tried once, but got some error message along the lines of "file not found". Thanks again.
        (ii) How do you, in fact, install CPAN modules? I tried once, but got some error message along the lines of "file not found".

        Please don't overlook our very fine Tutorials: Installing Modules should be very helpful to you.

        HTH,

        planetscape

        What operating system are you using? General instructions are here, but I think they're a bit old (the *nix ones look fine though at least). I believe on Windows there is a Perl package manager that you would use.

        If you are having trouble, you should register an account here and send me a message and I'll try and walk you through it.


        I'm a peripheral visionary... I can see into the future, but just way off to the side.