Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Google has failed me, so I am looking to build a program that will scour a website for a certain keyword. The website consists of thousand of pages. These are simple HTML pages (no flash, ajax, etc). Just simple text.

I am thinking that it would be best to use Mech. Does that sound good? What would be the most efficient tool to use? And on average, how long would it take to go through a large website of several thousand pages (about 5,000)? Should I consider forking the processes?

Thanks.

Replies are listed 'Best First'.
Re: Browsing website for keyword
by Your Mother (Archbishop) on May 24, 2010 at 01:30 UTC

    Your spec is vague. This is what I'd suggest given what you're asking: http://www.google.com/search?q=keyword+site%3Aperlmonks.org

    If you are spidering a site, you better have permission or ensure it's allowed by the site's ToS.

    If you're doing something ethical but for some reason can't rely on Google: Mech or LWP::UserAgent would be fine. 5,000 pages would probably only take about 20-30 minutes to spider without parallel requests but either way you might really be hammering a website. A dynamic site getting constant requests like that can be smothered depending on its server/architecture.

    KinoSearch 0.3 or better is great for search engines after the fact. With KSx::Simple + $mech->content(format => "text") you could have a lightning fast basic search engine done with like 20 lines of code (update: search part only). I know that's a valid estimate because I did one, against a backend DB, last night. :)

    Update: fixed mod://link (thanks rowdog)

Re: Browsing website for keyword
by Anonymous Monk on May 23, 2010 at 21:38 UTC
    Google has failed me, so I am looking to build a program that will scour a website for a certain keyword.

    No, you're looking for http://swish-e.org/