search across website for particular terms

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

Please bare with me, I have not touched perl in about 6 years but have a job that I think perl is perfect for. I have a website that I'm trying to build a glossary of terms for. The idea is to have a glossary page with definitions and then below that show WHERE that term shows up throughout the site. I essentially need something to search across all webpages and find the instances of the word I'm looking for. I'd like something like this:

$>perl wordsearch.pl <searchword>
Search Results:
<searchword> found in hello.html
<searchword> found in index.html
<searchword> found in about.hmtl
[download]

Does that make sense? I'm not certain what the perl instance is on the server I'm working on, so a brute force method might be best (I'm not certain I can add modules etc). Thank you all for your help!

Comment on search across website for particular terms Download Code

Replies are listed 'Best First'.
Re: search across website for particular terms by almut (Canon) on Jul 30, 2008 at 21:58 UTC
Not saying that Perl couldn't do it too... but maybe a simple recursive `grep` is sufficient. Something like `grep -r searchword /path/to/htdocs/` [download] (Of course that would search in the HTML markup etc. as well, not just the rendered page content... — Update: if that's a problem, you'd have to parse the HTML, for example using modules such as HTML::TokeParser, HTML::TokeParser::Simple or HTML::Parser. To recurse through the document tree, you could use File::Find) Update 2: Maybe you could save yourself a lot of work by using some existing software like ht://Dig to index your entire pages, and then issue a simple search if you want to know which documents contain a particular term...	[reply] [d/l] [select]
Re: search across website for particular terms by planetscape (Chancellor) on Jul 31, 2008 at 09:59 UTC
Several years ago, when I was very new to Perl, I needed to do something a lot like this - I have a wonderful vet, but she thinks computers are some sort of Black Art, and wouldn't even get on the internet. I had just gotten two pet rabbits, and while my vet had treated rabbits before, we agreed that some up-to-date info on this "exotic" pet would be helpful. I mirrored several sites that had some really good info (using a PC Magazine utility called "Site Snagger", though today I'd likely use w3mir or something else), placed the files on CD-ROM, and needed to build a useful index. First I needed to know what was in all the files, not being a vet myself (I think I just split canonized text on spaces or word boundaries). I also needed to filter out "stop words" (see also: Lingua::StopWords) - and HTML tags. Much of this I did manually, being, as I said, very new to Perl. ;-) But essentially what my index-building process looked like was to take a list of the terms I did want to include, and for each term, iterate through my documents (recursing on directories) to find which ones contained which terms. I did this with nothing more complicated than nested loops and grep, as almut suggested. I won't show you any code - as an indication of how "green" I was, my scripts used neither `use strict;` nor `use warnings;`, but I did use File::Slurp. ;-) You might also find modules such as Ted Pedersen's Ngram Statistics Package package useful for building lists of words that are in your files now. Some of the advice in Creating Dictionaries may also be helpful. Update: Oopsie... wrong Ted Pedersen link... HTH, planetscape	[reply] [d/l] [select]
Re: search across website for particular terms by ambrus (Abbot) on Jul 31, 2008 at 09:51 UTC
Merlyn's secret can also help here: there's a column on How do I make my web-pages searchable, and also on Search this site.	[reply]
Re: search across website for particular terms by eosbuddy (Scribe) on Jul 31, 2008 at 03:10 UTC
Don't know if this will work for you - found it using goo(d)o(ld)gle :-) http://www.linuxjournal.com/article/2200	[reply]