Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Re: Harvesting and Parsing HTML from other sites

by marius (Hermit)
on Mar 28, 2001 at 09:31 UTC ( #67753=note: print w/replies, xml ) Need Help??

in reply to Harvesting and Parsing HTML from other sites

First, change your @pages array to a hash. Then you can step through this with a:
foreach $page (keys %pages) { }
rather than the cumbersome and obfuscated for(){} loop above.

Second, a lot of your regexes don't need the /s modifier. See perldoc perlre for info about that.

Third, use strict.

And now for code error issues: I don't see where you set $keeperlength before using it in your nested for(){} loop. Incidentally, your changing of <tag> to {{{tag}}} doesn't account for things like <br />. That's a minor nitpick, though. Other than that, I can't see why it would "revert" back to the original $html variable. Wanna fix these things I've pointed out (or point out my flaws in thinking as the case may be =]) and try it, and if it still doesn't work point us to some pages that do and pages that don't work and we'll continue hammering.

Good luck!


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://67753]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2023-01-27 00:56 GMT
Find Nodes?
    Voting Booth?

    No recent polls found