Parsing pseudo-HTML with HTML::TokeParser

mp has asked for the wisdom of the Perl Monks concerning the following question:

Is HTML::TokeParser reliable for parsing HTML that has additional non-HTML tags. (<column> </column> in the example input below)?

Example input:

  <column>Colum <b>One</b> Header</column>
  <column>Column <u>Two</u> Header</column>
  <column na="1">Etcetera</column>
[download]

The code below seems to work, I just want to make sure that there are no gotchas with regards to using tags that look like HTML but really aren't valid html (things in angle brackets with optional attributes and optional slash indicating closing tag). I prefer to use HTML::TokeParser over XML::TokeParser because the text between the 'column' tags will in general not be well-formed XML.

use HTML::TokeParser;
sub parse_column_list {
  my ($str) = @_;
  my $p = HTML::TokeParser->new(\$str);
  my (@cl, $label, %attr);
  my %attr_default = ( na => 0 );
  while(my $t = $p->get_token) {
    if ($t->[0] eq "S" and $t->[1] eq "column") {
      $label = '';
      %attr = (%attr_default, %{$t->[2]});
    } elsif ($t->[0] eq "E" and $t->[1] eq "column") {
      push @cl, { %attr, label => $label };
    } else {
      if($t->[0] eq "T") {
        $label .= $t->[1];
      } else {
        $label .= $t->[-1];
      }
    }
  }
  return \@cl;
}
[download]

Comment on Parsing pseudo-HTML with HTML::TokeParser Select or Download Code

Replies are listed 'Best First'.
Re: Parsing pseudo-HTML with HTML::TokeParser by Ovid (Cardinal) on Sep 30, 2002 at 21:20 UTC
I'm not aware of any problems with parsing non-standard HTML. HTML is so mutable and browser dependant that unless you are using a tool that is requires a specific DTD, the code you use should be "fault tolerant", so to speak. As a side note, I'd recommend HTML::TokeParser::Simple (full disclosure: I wrote it). It makes your code shorter and easier to read. Here's a small demo. #!/usr/bin/perl -w use strict; use HTML::TokeParser::Simple; use Data::Dumper; my $pseudo_html; { local $/; $pseudo_html = <DATA>; } print Dumper parse_column_list( $pseudo_html ); sub parse_column_list { my ($str) = @_; my $p = HTML::TokeParser::Simple->new(\$str); my (@cl, $label, %attr); my %attr_default = ( na => 0 ); while(my $t = $p->get_token) { if ( $t->is_start_tag( 'column' ) ) { $label = ''; %attr = (%attr_default, %{$t->return_attr}); } elsif ( $t->is_end_tag( 'column' ) ) { push @cl, { %attr, label => $label }; } else { $label .= $t->return_text; } } return \@cl; } __DATA__ <column>Colum <b>One</b> Header</column> <column>Column <u>Two</u> Header</column> <column na="1">Etcetera</column> [download] Cheers, Ovid Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.	[reply] [d/l]
Re: Parsing pseudo-HTML with HTML::TokeParser by Helter (Chaplain) on Sep 30, 2002 at 17:54 UTC
Reading about HTML::Parser: `As markup and text is recognized, handlers are invoked. The following +method is used to set up handlers for different events:` [download] So I would assume that as long as there are no handlers assigned to those tags they would be ignored. On the other hand, you might want to define handlers for this code to make your processing life easier. I'm new to using these tools andh ave never used this one in particular so I'm just stating what I read, someone else could probably provide tested code/answers. Hope this helps!	[reply] [d/l]
Re: Parsing pseudo-HTML with HTML::TokeParser by mp (Deacon) on Oct 02, 2002 at 15:56 UTC
Thank you for the replies, and thanks for the pointer to HTML::TokeParser::Simple. It does improve the code's readability.	[reply]