Re: How To Write A Scraper?

Replies are listed 'Best First'.
Re^2: How To Write A Scraper? by Cody Pendant (Prior) on Jul 04, 2005 at 00:09 UTC
I don't disagree that you need to know the HTML before you start. It was more about creating a scraper object, not using procedural code, in which the part that "knows the HTML" is a neat sub-section, using a regex or instructions for a parser. So rather than have script A which says `# ... having got to a certain page $mech->content() =~ m/LONGFIDDLYCAPTURINGREGEXHERE/; my $place_where_the_links_are = $1; # get the links from that place and continue` [download] and another script B which does the same, and another script C with yet another, and so on, at least I could push the complexity down into a module, object etc and not have to see it, and not have to write multiple scripts for multiple publications. And the moment the NYT changes their HTML, someone could figure out the new regex and update Scraper::Newspapers::NYT or whatever it would be. ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply] [d/l]
Re^3: How To Write A Scraper? by jpeg (Chaplain) on Jul 04, 2005 at 00:25 UTC
Oh, I see....I think. At the moment, it sounds like you're describing a collection of site-specific parsers, like the Finance::Quote tree, perhaps? -- jpg	[reply]
Re^4: How To Write A Scraper? by Cody Pendant (Prior) on Jul 04, 2005 at 00:46 UTC
Aha, yes, that looks like the kind of thing. They have "Finance::Quote::Yahoo" and "Finance::Quote::Tdwaterhouse" and so on. Presumably there's some kind of upating mechanism which only updates the "Tdwaterhouse" part when they change their HTML? I will research further, thank you. ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply]
Re^4: How To Write A Scraper? by Cody Pendant (Prior) on Jul 04, 2005 at 01:05 UTC
That was the kind of thing I was thinking of, yes. A top-level scraper which loads, when required, a sub-scraper for a specific area. Although finance quotes are of course very much more specific in form than "interesting articles from online papers"... ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply]