Hello,
I am new to Perl. I am currently trying to write a very simple vector space algorithm. I have searched the site for similar questions to mine, but I have not found anything.
I have programmed this algorithm in another language in the past (i.e. LabView), so I know everything that I want to do. Sadly, I do not know how to do these things in Perl!
The first thing that I need to do is read a large dataset of xml documents (which are all different folders in one parent directory) and pick out all of the words in them. Then I need to refine that word list into a dictionary.
I have tried following the example vector space algorithm here: http://www.perl.com/pub/2003/02/19/engine.html
However, it reads the dataset into the RAM. I have about 1gb of RAM and the dataset is 2.5gb, so I am guessing that this will not work.
Here's what I have started with:
use File::Find; my $localdir = '[PARENT DIRECTORY]'; find( sub { print $File::Find::name, "\n" if /\.xml$/ }, $localdir);
This outputs all of the file names and locations for the xml files. I guess that this is a good start for reading the files in one at a time so that I can create my dictionary.
Can anyone tell me how to then create the list of words as a separate file?
I am very grateful for your assistance.
Adam
In reply to Vector space algorithm by AdamEdinburgh
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |