There are several issues you need to address:
- Extracting the text
- Detecting the words
- Eliminating noise words
- Finding word frequencies
Of those steps the last is the easiest:
use strict;
use warnings;
my $words = <<WORDS;
i have a large mass of files mostly pdf that i want to reorganize base
+d on
keyword frequency as a first step i obviously need to analyze each fil
+e
basically i'd like to so something akin to what in sas is called a pro
+c freq
on each file and extract a list of the top 10 keywords for each those
+stats
would be plugged into a spreadsheet, which would be parsed separately
eventually any suggestions on links on how to get started on this than
+ks in
advance
WORDS
my %freq;
++$freq{$_} for split /\s+/, $words;
print "$_: $freq{$_}\n" for sort {$freq{$b} <=> $freq{$a}} keys %freq;
Prints (in part):
on: 5
to: 5
a: 5
each: 3
i: 3
file: 2
of: 2
would: 2
in: 2
be: 2
want: 1
which: 1
files: 1
thanks: 1
that: 1
keywords: 1
i'd: 1
eventually: 1
mostly: 1
For steps 2 and 3 you should take a look at the Lingua area of CPAN. For example Lingua::EN::Splitter may help extract words. Lingua::EN::StopWords may help remove noise words.
True laziness is hard work
|