Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Most efficient way to parse web pages

by eduardo (Curate)
on Jun 19, 2000 at 07:17 UTC ( [id://18762]=note: print w/replies, xml ) Need Help??


in reply to Most efficient way to parse web pages

at work, we've written a distributed web spider... basically it's a forking model, that then get's thrown around on a mosix cluster... but anyways, i digress. what we've done is used the Parse::RecDescent module from CPAN and built up a grammer for the parsing of webpages. Then we describe a website using the metalanguage described above and it generates an automaton that goes out, grabs the webpage, and removes the important parts. Very flexible, very powerful, and we can parse millions of pages a day with it.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://18762]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2024-03-28 22:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found