s_club_seven has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to input a HTML file that contains a table. It's a college timetable. I want to extract information for a particular time, but the information is generated in the HTML file as just a normal table. So, I'd need to count the number of TD's to get to a certain time and then read the contents (or something). I have _no_ idea how to do this Any help?

Replies are listed 'Best First'.
Re: Parsing HTML
by merlyn (Sage) on Mar 03, 2001 at 22:33 UTC
Re: Parsing HTML
by TheoPetersen (Priest) on Mar 03, 2001 at 22:40 UTC
    When faced with this kind of task, a lot of Perl coders:
    • see that HTML is not that hard, and figure on parsing it manually;
    • find out that HTML is deceptive (or the person or process that writes the file writes lousy HTML) and figure on using a tool;
    • discover HTML::Parser, read the doc and say "that's too hard!"
    • go back to parsing it manually and come up with something that works as long as nothing ever changes.
    At least, that's how me and my co-workers did it once :)

    So as a result, I'd suggest looking at HTML::Parser or one of its relatives. I used HTML::TreeBuilder to parse some quite large and unreliable HTML files and found that it worked great. The tricky bit is learning how to code in the callback style required, but you can get lots of help on that here once you've started.

(arturo) Re: Parsing HTML -- as text?
by arturo (Vicar) on Mar 03, 2001 at 22:55 UTC

    Another way to do it, that might simplify the task:

    lynx -dump http://www.foo.edu/classlist.html > sched.txt
    It might be conceptually simpler to deal with straight text than with HTML.

    HTH

    Philosophy can be made out of anything. Or less -- Jerry A. Fodor

Re: Parsing HTML
by strredwolf (Chaplain) on Mar 04, 2001 at 08:55 UTC
    There are many ways of doing it (in order of eaze):

    • Use merlyn's suggestion of an HTML module. (But then, merlyn mirrors all of CPAN onto his system...)
    • Pharsing the output of lynx -dump file.... I actually did this for a non-profit journal publisher for an alert system. Verified opt-in, too.
    • Treating the HTML as an XML file and programming state engines. Boy, do I like programming state engines...
    It seems like you're attacking it via the last method, which can be hell in a handbasket. Try them in order, though. First virtue of Perl.

    --
    $Stalag99{"URL"}="http://stalag99.keenspace.com";

Re: Parsing HTML
by dvergin (Monsignor) on Mar 03, 2001 at 22:37 UTC
    It sounds like perhaps you are asking for more than just how to grab data from an HTML table.

    "I'd need to count the number of TD's"? So the time is not in a TD cell? Hmmm... Perhaps if you showed us a sample of the HTML data.