svend_ok has asked for the wisdom of the Perl Monks concerning the following question:

I have a large mass of files (mostly PDF) that I want to reorganize based on keyword frequency. As a first step, I obviously need to analyze each file. Basically, I'd like to so something akin to what in SAS is called a "PROC FREQ" on each file and extract a list of the top 10 keywords for each. Those stats would be plugged into a spreadsheet, which would be parsed separately eventually. Any suggestions on links on how to get started on this? Thanks in advance.
  • Comment on extracting keyword frequency from files

Replies are listed 'Best First'.
Re: extracting keyword frequency from files
by GrandFather (Saint) on Jun 04, 2010 at 01:12 UTC

    There are several issues you need to address:

    1. Extracting the text
    2. Detecting the words
    3. Eliminating noise words
    4. Finding word frequencies

    Of those steps the last is the easiest:

    use strict; use warnings; my $words = <<WORDS; i have a large mass of files mostly pdf that i want to reorganize base +d on keyword frequency as a first step i obviously need to analyze each fil +e basically i'd like to so something akin to what in sas is called a pro +c freq on each file and extract a list of the top 10 keywords for each those +stats would be plugged into a spreadsheet, which would be parsed separately eventually any suggestions on links on how to get started on this than +ks in advance WORDS my %freq; ++$freq{$_} for split /\s+/, $words; print "$_: $freq{$_}\n" for sort {$freq{$b} <=> $freq{$a}} keys %freq;

    Prints (in part):

    on: 5 to: 5 a: 5 each: 3 i: 3 file: 2 of: 2 would: 2 in: 2 be: 2 want: 1 which: 1 files: 1 thanks: 1 that: 1 keywords: 1 i'd: 1 eventually: 1 mostly: 1

    For steps 2 and 3 you should take a look at the Lingua area of CPAN. For example Lingua::EN::Splitter may help extract words. Lingua::EN::StopWords may help remove noise words.

    True laziness is hard work
Re: extracting keyword frequency from files
by Khen1950fx (Canon) on Jun 04, 2010 at 00:23 UTC
Re: extracting keyword frequency from files
by planetscape (Chancellor) on Jun 04, 2010 at 01:32 UTC