Most efficient way to parse web pages

cbraga has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a perl script that takes html files, up to 50 kB in size, and I need to extract some information, which are in a few lines. I wonder what would be the most efficient ways to find the information.

A number of ways come to mind, such as using regexps, or grep then regexps, but it is quite important to minimize CPU usage as this will be part of a web spider, and I thought some of the monks might have better ideas. Thanks.

Comment on Most efficient way to parse web pages

Replies are listed 'Best First'.
Re: Most efficient way to parse web pages by merlyn (Sage) on Jun 19, 2000 at 05:20 UTC
`HTML::Parser` is written in C. With the right handler configuration, it's very hard to beat that. -- Randal L. Schwartz, Perl hacker	[reply]
Re: Most efficient way to parse web pages by eduardo (Curate) on Jun 19, 2000 at 07:17 UTC
at work, we've written a distributed web spider... basically it's a forking model, that then get's thrown around on a mosix cluster... but anyways, i digress. what we've done is used the Parse::RecDescent module from CPAN and built up a grammer for the parsing of webpages. Then we describe a website using the metalanguage described above and it generates an automaton that goes out, grabs the webpage, and removes the important parts. Very flexible, very powerful, and we can parse millions of pages a day with it.	[reply]
Re: Most efficient way to parse web pages by Vane (Novice) on Jun 19, 2000 at 08:10 UTC
If you know exactly what you're looking for, and it's in random order (all pages are equally important, not a tree; uncontrolled HTML -- you didn't write it) nothing beats a forking LWP get, slurp, match except multiple machines doing parallel (fork, get, slurp, match). Parser has to parse all sorts of fat, sloppy, mixed content, rarely correct HTML written by fools with Word, FrontPage or Dreamweaver so it looks good. I take it you're looking for something specific. Never looked at the Parse::RecDescent module though; but it has to do something less involved than HTML::Parser. I would look into it, thanks, but you'll needs write your own anyway. So, "keep it short and simple", "spread the work", "tune the fork(s)". "tips:" Slurp with $/ keyed on what you're looking for if what you're looking for is likely to be near the beginning of the page, or null (and m//g) if it's not or if there are several potentially random occurances. (m//g & pos) is real quick and nestable. Because multiple forks and multiple machines are by nature asynchronous, they make a mighty engine over TCP/IP by using the space. Vane	[reply]