extracting keyword frequency from files

svend_ok has asked for the wisdom of the Perl Monks concerning the following question:

I have a large mass of files (mostly PDF) that I want to reorganize based on keyword frequency. As a first step, I obviously need to analyze each file. Basically, I'd like to so something akin to what in SAS is called a "PROC FREQ" on each file and extract a list of the top 10 keywords for each. Those stats would be plugged into a spreadsheet, which would be parsed separately eventually. Any suggestions on links on how to get started on this? Thanks in advance.

Comment on extracting keyword frequency from files

Replies are listed 'Best First'.
Re: extracting keyword frequency from files by GrandFather (Saint) on Jun 04, 2010 at 01:12 UTC
There are several issues you need to address: Extracting the text Detecting the words Eliminating noise words Finding word frequencies Of those steps the last is the easiest: use strict; use warnings; my $words = <<WORDS; i have a large mass of files mostly pdf that i want to reorganize base +d on keyword frequency as a first step i obviously need to analyze each fil +e basically i'd like to so something akin to what in sas is called a pro +c freq on each file and extract a list of the top 10 keywords for each those +stats would be plugged into a spreadsheet, which would be parsed separately eventually any suggestions on links on how to get started on this than +ks in advance WORDS my %freq; ++$freq{$_} for split /\s+/, $words; print "$_: $freq{$_}\n" for sort {$freq{$b} <=> $freq{$a}} keys %freq; [download] Prints (in part): `on: 5 to: 5 a: 5 each: 3 i: 3 file: 2 of: 2 would: 2 in: 2 be: 2 want: 1 which: 1 files: 1 thanks: 1 that: 1 keywords: 1 i'd: 1 eventually: 1 mostly: 1` [download] For steps 2 and 3 you should take a look at the Lingua area of CPAN. For example Lingua::EN::Splitter may help extract words. Lingua::EN::StopWords may help remove noise words. True laziness is hard work	[reply] [d/l] [select]
Re: extracting keyword frequency from files by Khen1950fx (Canon) on Jun 04, 2010 at 00:23 UTC
Check File::Extractor and its getKeywords method. It requires libextractor.	[reply]
Re: extracting keyword frequency from files by planetscape (Chancellor) on Jun 04, 2010 at 01:32 UTC
You may wish to take a look at the Ngram Statistics Package, by Ted Pedersen. HTH, planetscape	[reply]