RatArsed has asked for the wisdom of the Perl Monks concerning the following question:

I have a behemoth of a task to do, I.e. taking a collection of HTML files, parsing specific information out of them, modifying them to add some of this new improved data and saving them back out again, before finally changing the last modified date.

Erk.

My approach here at first was to use a behemoth of a regular expression (so I can do things like getting the contents of the cell after the one that says "Date" in it); But I was sat there thinking, shouldn't HTML::Parser offer a nicer approach? But can I find any example code? Can I heck... Plenty of references saying stuff like "use HTML::Parser" but that's just as helpful to go and sit in the corner and make chicken noises for an hour...

--
RatArsed

Replies are listed 'Best First'.
Re: HTML::Parser example wanted...
by andreychek (Parson) on Jun 26, 2001 at 19:19 UTC
    Actually, there are a bunch of examples that come with the HTML::Parser module, found in the "eg" directory. Taking the code from there, here is an example of how to parse all the text from an HTML document:
    #!/usr/bin/perl -w # Extract all plain text from an HTML file use strict; use HTML::Parser 3.00 (); my %inside; sub tag { my($tag, $num) = @_; $inside{$tag} += $num; print " "; # not for all tags } sub text { return if $inside{script} || $inside{style}; print $_[0]; } HTML::Parser->new(api_version => 3, handlers => [start => [\&tag, "tagname, '+1'"], end => [\&tag, "tagname, '-1'"], text => [\&text, "dtext"], ], marked_sections => 1, )->parse_file(shift) || die "Can't open file: $!\n";;
    That code is located in eg/htext. After taking a look, you can see that it is event driven. The HTML::Parser->new line has an option in it called "handlers", which tells HTML::Parser which function to call upon seeing a certain tag type. In this case, every start tag calls the function "tag" with the parameters "tagname", which is the actual tagname, and +1, which identifies it as a start tag.

    Personally, I have had more luck with HTML::TokeParser, but that isn't the case for everyone I'm sure. I find that HTML::TokeParser is a bit more intuitive for this sort of job, but that is perhaps just the way I think.. or maybe I just wasn't using it right ;-) In any case, good luck.
    -Eric
Re: HTML::Parser example wanted...
by LD2 (Curate) on Jun 26, 2001 at 19:17 UTC
Re: HTML::Parser example wanted...
by larsen (Parson) on Jun 26, 2001 at 20:03 UTC
Re: HTML::Parser example wanted...
by princepawn (Parson) on Jun 26, 2001 at 21:15 UTC
    I found HTML::TokeParser (part of the HTML::Parser distribution) to be easier to use but in this case I used HTML::TreeBuilder.

    This example reads a 2x2 table.

    #!/usr/local/bin/perl use Data::Dumper; use HTML::TreeBuilder; use strict; die "must input filename" unless @ARGV; foreach my $file_name (@ARGV) { my $tree = HTML::TreeBuilder->new; # empty tree $tree->parse_file($file_name); print "Hey, here's a dump of the parse tree of $file_name:\n"; # $tree->dump; # a method we inherit from HTML::Element # Now that we're done with it, we must destroy it. my %table; ( $table{root}, $table{cond}, $table{'cond-alternatives'}, $table{action}, $table{'action-entries'} ) = $tree->find_by_tag_name('table'); my %td; map { $td{$_} = [ $table{$_}->find_by_tag_name('td') ] } (keys %tabl +e); my %x; map { my $field = $_; map { push @{$x{$field}}, $_->content_array_ref } @{$td{$_}} } (keys %td); printf "cond-alt has %s", Dumper $x{'cond-alternatives'}; $tree = $tree->delete; }
Re: HTML::Parser example wanted...
by Beatnik (Parson) on Jun 26, 2001 at 21:34 UTC
    if davorg permits me to quote Data munging with Perl, chapter 9, page 165.
    #!/usr/bin/perl -w use strict; use HTML::Parser; use LWP::Simple; sub start { my ($tag, $attr, $attrseq) = @_; print "Found $tag\n"; foreach(@$attrseq) { print " [$_ -> $attr->{$_}]\n"; } } my $h = HTML::Parser->new(start_h => [\&start,'tagname, attr, attrseq' +]); my $page = get(shift); $h->parse($page);
    Greetz
    Beatnik
    ... Quidquid perl dictum sit, altum viditur.
Re: HTML::Parser example wanted...
by Graham (Deacon) on Jun 26, 2001 at 19:13 UTC
      I'd be after an example because the documentation isn't really clear that it can do what I want...

      Unfortunatly, the other node to which you refer in turn refers to version 2 of the parser (and I have 3, which, I believe, works diferently) and to TPJ, which is, er, closed...

      --
      RatArsed