AdamEdinburgh has asked for the wisdom of the Perl Monks concerning the following question:
Hello,
I am new to Perl. I am currently trying to write a very simple vector space algorithm. I have searched the site for similar questions to mine, but I have not found anything.
I have programmed this algorithm in another language in the past (i.e. LabView), so I know everything that I want to do. Sadly, I do not know how to do these things in Perl!
The first thing that I need to do is read a large dataset of xml documents (which are all different folders in one parent directory) and pick out all of the words in them. Then I need to refine that word list into a dictionary.
I have tried following the example vector space algorithm here: http://www.perl.com/pub/2003/02/19/engine.html
However, it reads the dataset into the RAM. I have about 1gb of RAM and the dataset is 2.5gb, so I am guessing that this will not work.
Here's what I have started with:
use File::Find; my $localdir = '[PARENT DIRECTORY]'; find( sub { print $File::Find::name, "\n" if /\.xml$/ }, $localdir);
This outputs all of the file names and locations for the xml files. I guess that this is a good start for reading the files in one at a time so that I can create my dictionary.
Can anyone tell me how to then create the list of words as a separate file?
I am very grateful for your assistance.
Adam
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Vector space algorithm
by snape (Pilgrim) on Jun 07, 2012 at 21:26 UTC | |
|
Re: Vector space algorithm
by locked_user sundialsvc4 (Abbot) on Jun 08, 2012 at 14:23 UTC |