Re: Parsing pseudo-HTML with HTML::TokeParser

I'm not aware of any problems with parsing non-standard HTML. HTML is so mutable and browser dependant that unless you are using a tool that is requires a specific DTD, the code you use should be "fault tolerant", so to speak. As a side note, I'd recommend HTML::TokeParser::Simple (full disclosure: I wrote it). It makes your code shorter and easier to read. Here's a small demo.

#!/usr/bin/perl -w
use strict;
use HTML::TokeParser::Simple;
use Data::Dumper;

my $pseudo_html;
{
    local $/;
    $pseudo_html = <DATA>;
}

print Dumper parse_column_list( $pseudo_html );

sub parse_column_list {
  my ($str) = @_;
  my $p = HTML::TokeParser::Simple->new(\$str);
  my (@cl, $label, %attr);
  my %attr_default = ( na => 0 );
  while(my $t = $p->get_token) {
    if ( $t->is_start_tag( 'column' ) ) {
      $label = '';
      %attr = (%attr_default, %{$t->return_attr});
    } 
    elsif ( $t->is_end_tag( 'column' ) ) {
      push @cl, { %attr, label => $label };
    } 
    else {
      $label .= $t->return_text;
    }
  }
  return \@cl;
}

__DATA__
  <column>Colum <b>One</b> Header</column>
  <column>Column <u>Two</u> Header</column>
  <column na="1">Etcetera</column>
[download]

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Comment on Re: Parsing pseudo-HTML with HTML::TokeParser Download Code