hodashirzad has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I am writting a crawler in perl for price comparison and I wanted to know if there is a way to create a crawler that no matter what URL you pass to it, it will find the products and prices by recognising the website's structure. If so how? I just need a little bit of introduction and guidence I am all new to this.

Thanks in advance.

Replies are listed 'Best First'.
Re: Crawler in perl
by naikonta (Curate) on Apr 22, 2007 at 11:29 UTC
    Hmm, let see. How about you start with WWW::Mechanize? From my own experience, once I give it a URL, the crawler will just walktrough no matter how many time it was redirected to. I'm not sure what you mean by 'recognizing', but I need to teach it a little bit on how to find something I want.

    Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

      well I have already generated a crawler that works with three websites and that's obviously because I told the crawler to look for the products in the right place depending on the website (therefore it wont work with any other website), but i want to know is that is there a way to find the website's template by comparing the pages in a website ( for example finding reputations inside <td> tags therefore classifying them as the template and disregarding them ) that way no matter what website you give the crawler it can find the products and prices. I honestly dont know if it exists but people have been asking me if i can generate a crawler that works with most websites and i wonder how big websites such as kelko work?

        Never heard of this "Kelko" thing before, but three clicks through their "About Us" page to their FAQ brings up the answer:

        Q: Does Kelkoo search all shops on the web?
        A: No, efficiently comparing prices from all shops on the web would be extremely difficult because there are far too many of them. Instead, we select a wide group of shops including big high street names and specialist internet shops. We are constantly looking for shops to add to our affiliate programme, and if we find a shop that has better offers than our current set, we contact them and try to include them on Kelkoo. If you can find a better price elsewhere, we'd love to hear it!

        So they're more than likely writing scrapers for the sites they're specifically interested in, or they're probably big enough (as part of Yahoo) to have worked out some sort of arrangement with the source site to provide raw data.

        Now there are approaches such as this Ruby work which provide a DSL (domain specific language) which lets you describe scrapers in DOM/CSS terms which make it easier to build up scrapers for new sites. I'm not aware of any Perl implementations of this idea, but that might steer you in the right direction.

        No ... not at this moment. Maybe if the Semantic Web ever takes off but I'm not holding my breath on that happening.

        -derby