# ... having got to a certain page
$mech->content() =~ m/LONGFIDDLYCAPTURINGREGEXHERE/;
my $place_where_the_links_are = $1;
# get the links from that place and continue
and another script B which does the same, and another script C with yet another, and so on, at least I could push the complexity down into a module, object etc and not have to see it, and not have to write multiple scripts for multiple publications.
And the moment the NYT changes their HTML, someone could figure out the new regex and update Scraper::Newspapers::NYT or whatever it would be.
($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print
| [reply] [d/l] |
Oh, I see....I think.
At the moment, it sounds like you're describing a collection of site-specific parsers, like the Finance::Quote tree, perhaps?
| [reply] |
Aha, yes, that looks like the kind of thing.
They have "Finance::Quote::Yahoo" and "Finance::Quote::Tdwaterhouse" and so on.
Presumably there's some kind of upating mechanism which only updates the "Tdwaterhouse" part when they change their HTML?
I will research further, thank you.
($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print
| [reply] |
That was the kind of thing I was thinking of, yes.
A top-level scraper which loads, when required, a sub-scraper for a specific area. Although finance quotes are of course very much more specific in form than "interesting articles from online papers"...
($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print
| [reply] |