AdamEdinburgh has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am new to Perl. I am currently trying to write a very simple vector space algorithm. I have searched the site for similar questions to mine, but I have not found anything.

I have programmed this algorithm in another language in the past (i.e. LabView), so I know everything that I want to do. Sadly, I do not know how to do these things in Perl!

The first thing that I need to do is read a large dataset of xml documents (which are all different folders in one parent directory) and pick out all of the words in them. Then I need to refine that word list into a dictionary.

I have tried following the example vector space algorithm here: http://www.perl.com/pub/2003/02/19/engine.html

However, it reads the dataset into the RAM. I have about 1gb of RAM and the dataset is 2.5gb, so I am guessing that this will not work.

Here's what I have started with:

use File::Find; my $localdir = '[PARENT DIRECTORY]'; find( sub { print $File::Find::name, "\n" if /\.xml$/ }, $localdir);

This outputs all of the file names and locations for the xml files. I guess that this is a good start for reading the files in one at a time so that I can create my dictionary.

Can anyone tell me how to then create the list of words as a separate file?

I am very grateful for your assistance.

Adam

Replies are listed 'Best First'.
Re: Vector space algorithm
by snape (Pilgrim) on Jun 07, 2012 at 21:26 UTC

    Hi

    You can do the following the steps:

    1. Glob the parent directory or use File:Find to find all the relevant files. 2. Read the files (which you are interested in) and parse out the words you want in an output file. 3. You will have the dictionary of words ( and locations of the files if you are interested in). Here is some start code

    ## Globbing opendir(DIR, $inPath) or die$!; my @xmlfiles = glob("*.xml"); foreach my $file (@files) { print "$file\n"; ## Lists all the files }

    Reference: GlobFiles

    for(my $file =0; $file<=$#xmlfiles; $file++){ ## $#xmlfiles gives you +the last index of the file or length of the array -1 open my $IN, $file or die $!; open my $OUT, ">outfile.txt" or die $!; while(<$IN>){ ## read the file line by line ## split and parse the words you are interested in ## suppose $word has the word u want to output in the file then, $word + is a scalar variable print $OUT $word,"\n"; } close($IN); } close($OUT);

    Reference: split, openreadfiles, whileloop

Re: Vector space algorithm
by locked_user sundialsvc4 (Abbot) on Jun 08, 2012 at 14:23 UTC

    I feel that the above approach might be only a small part of the solution.   A module such as File::Find (and its many brethren) can tackle the first task of locating the files, perhaps so that all of the names can be pushed onto a list.   The second part is going to require the services of a Perl library, e.g. XML::LibXML, that is well known to be capable of handling arbitrarily-large documents.   I would also suggest investigating pure-XML technologies, such as XPath and XSLT, that might enable you to at least isolate the relevant strings within the XML structure without writing location-specific Perl logic to do so.   You might even discover that a substantial and useful subset of the process can be expressed as an XSLT transformation.

    I generally don’t feel that the proper approach for dealing with what is known to be an XML file ... is to treat it simply line-by-line as a file, even if you are “merely” looking for words.   XML documents have a complex internal structure that must be respected ... and there are many sophisticated, well-tested tools and libraries for dealing with them.   (The Perl module cited above is, of course, a “wrapper” API for one of those libraries.)

    I now leave the “vector-space algorithm” part of the issue to the wisdom of other Monks, for about such things I know nothing at all.