Re: Vector space algorithm

I feel that the above approach might be only a small part of the solution. A module such as File::Find (and its many brethren) can tackle the first task of locating the files, perhaps so that all of the names can be pushed onto a list. The second part is going to require the services of a Perl library, e.g. XML::LibXML, that is well known to be capable of handling arbitrarily-large documents. I would also suggest investigating pure-XML technologies, such as XPath and XSLT, that might enable you to at least isolate the relevant strings within the XML structure without writing location-specific Perl logic to do so. You might even discover that a substantial and useful subset of the process can be expressed as an XSLT transformation.

I generally don’t feel that the proper approach for dealing with what is known to be an XML file ... is to treat it simply line-by-line as a file, even if you are “merely” looking for words. XML documents have a complex internal structure that must be respected ... and there are many sophisticated, well-tested tools and libraries for dealing with them. (The Perl module cited above is, of course, a “wrapper” API for one of those libraries.)

I now leave the “vector-space algorithm” part of the issue to the wisdom of other Monks, for about such things I know nothing at all.