Several years ago, when I was very new to Perl, I needed to do something a lot like this - I have a wonderful vet, but she thinks computers are some sort of Black Art, and wouldn't even get on the internet. I had just gotten two pet rabbits, and while my vet had treated rabbits before, we agreed that some up-to-date info on this "exotic" pet would be helpful. I mirrored several sites that had some really good info (using a PC Magazine utility called "Site Snagger", though today I'd likely use w3mir or something else), placed the files on CD-ROM, and needed to build a useful index.

First I needed to know what was in all the files, not being a vet myself (I think I just split canonized text on spaces or word boundaries). I also needed to filter out "stop words" (see also: Lingua::StopWords) - and HTML tags. Much of this I did manually, being, as I said, very new to Perl. ;-)

But essentially what my index-building process looked like was to take a list of the terms I did want to include, and for each term, iterate through my documents (recursing on directories) to find which ones contained which terms. I did this with nothing more complicated than nested loops and grep, as almut suggested.

I won't show you any code - as an indication of how "green" I was, my scripts used neither use strict; nor use warnings;, but I did use File::Slurp. ;-)

You might also find modules such as Ted Pedersen's Ngram Statistics Package package useful for building lists of words that are in your files now. Some of the advice in Creating Dictionaries may also be helpful.


Update: Oopsie... wrong Ted Pedersen link...

HTH,

planetscape

In reply to Re: search across website for particular terms by planetscape
in thread search across website for particular terms by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.