in reply to How To Write A Scraper?

I don't think there *can* be a general scraping module. There probably aren't as many as many ways to lay out an HTML page as there are sites out there, but it's close. With our current tools, you simply need to know the HTML of a page before you can parse it. That's the only guaranteed way of getting what you want. That said, there are workarounds: see the last time this question came up.

--
jpg

Replies are listed 'Best First'.
Re^2: How To Write A Scraper?
by Cody Pendant (Prior) on Jul 04, 2005 at 00:09 UTC
    I don't disagree that you need to know the HTML before you start. It was more about creating a scraper object, not using procedural code, in which the part that "knows the HTML" is a neat sub-section, using a regex or instructions for a parser.

    So rather than have script A which says

    # ... having got to a certain page $mech->content() =~ m/LONGFIDDLYCAPTURINGREGEXHERE/; my $place_where_the_links_are = $1; # get the links from that place and continue
    and another script B which does the same, and another script C with yet another, and so on, at least I could push the complexity down into a module, object etc and not have to see it, and not have to write multiple scripts for multiple publications.

    And the moment the NYT changes their HTML, someone could figure out the new regex and update Scraper::Newspapers::NYT or whatever it would be.



    ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
    =~y~b-v~a-z~s; print
      Oh, I see....I think.
      At the moment, it sounds like you're describing a collection of site-specific parsers, like the Finance::Quote tree, perhaps?
      --
      jpg
        Aha, yes, that looks like the kind of thing.

        They have "Finance::Quote::Yahoo" and "Finance::Quote::Tdwaterhouse" and so on.

        Presumably there's some kind of upating mechanism which only updates the "Tdwaterhouse" part when they change their HTML?

        I will research further, thank you.



        ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
        =~y~b-v~a-z~s; print
        That was the kind of thing I was thinking of, yes.

        A top-level scraper which loads, when required, a sub-scraper for a specific area. Although finance quotes are of course very much more specific in form than "interesting articles from online papers"...



        ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
        =~y~b-v~a-z~s; print