Parsing HTML

s_club_seven has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing HTML by merlyn (Sage) on Mar 03, 2001 at 22:33 UTC
HTML::TableExtract was made precisely for that. -- Randal L. Schwartz, Perl hacker	[reply]
Re: Parsing HTML by TheoPetersen (Priest) on Mar 03, 2001 at 22:40 UTC
When faced with this kind of task, a lot of Perl coders: see that HTML is not that hard, and figure on parsing it manually; find out that HTML is deceptive (or the person or process that writes the file writes lousy HTML) and figure on using a tool; discover HTML::Parser, read the doc and say "that's too hard!" go back to parsing it manually and come up with something that works as long as nothing ever changes. At least, that's how me and my co-workers did it once :) So as a result, I'd suggest looking at HTML::Parser or one of its relatives. I used HTML::TreeBuilder to parse some quite large and unreliable HTML files and found that it worked great. The tricky bit is learning how to code in the callback style required, but you can get lots of help on that here once you've started.	[reply]
(arturo) Re: Parsing HTML -- as text? by arturo (Vicar) on Mar 03, 2001 at 22:55 UTC
Another way to do it, that might simplify the task: `lynx -dump http://www.foo.edu/classlist.html > sched.txt` [download] It might be conceptually simpler to deal with straight text than with HTML. HTH Philosophy can be made out of anything. Or less -- Jerry A. Fodor	[reply] [d/l]
Re: Parsing HTML by strredwolf (Chaplain) on Mar 04, 2001 at 08:55 UTC
There are many ways of doing it (in order of eaze): Use merlyn's suggestion of an HTML module. (But then, merlyn mirrors all of CPAN onto his system...) Pharsing the output of `lynx -dump file...`. I actually did this for a non-profit journal publisher for an alert system. Verified opt-in, too. Treating the HTML as an XML file and programming state engines. Boy, do I like programming state engines... It seems like you're attacking it via the last method, which can be hell in a handbasket. Try them in order, though. First virtue of Perl. -- $Stalag99{"URL"}="http://stalag99.keenspace.com";	[reply] [d/l]
Re: Parsing HTML by dvergin (Monsignor) on Mar 03, 2001 at 22:37 UTC
It sounds like perhaps you are asking for more than just how to grab data from an HTML table. "I'd need to count the number of TD's"? So the time is not in a TD cell? Hmmm... Perhaps if you showed us a sample of the HTML data.	[reply]