Re: Parsing Table

There are some parsing idiosyncrasies (like tags are all considered lowercase within the parser) but the HTML::TokeParser family is usually quite good for anything that is SGMLish.

Try this. It should be close to what you want already and pretty obvious how to adapt. See also: HTML::TokeParser::Simple. (Update: pulled YAML from sample code, it wasn't there for any reason.)

use warnings;
use strict;
use HTML::TokeParser::Simple;

my $p = HTML::TokeParser::Simple->new(\*DATA);

while ( my $token = $p->get_tag() )
{
    if ( $token->get_tag =~ /\Atr(;+\d+)?\z/ )
    {
        while ( $p->peek and $p->peek !~ /<(Tr\b|endtab)/ )
        {
            my $token = $p->get_token or next;
            print $token->as_is, " + ";
        }
        print "\n";
    }
}

__DATA__
<Tr><Tc>PA Group (N <26> 23)<Tc>COM Group (N <26> 24)<Tc>
<Tr>Gender<Tc><Tc><Tc><124><sup>2<reset> test <26> 0.216, <mdit>df<med
+> <26> 1, <mdit>P<med> <26> 0.642
<Tr><ems>Male (%)<Tc>14<ths>(60.9)<Tc>13<ths>(54.2)<Tc>
<Tr><ems>Female (%)<Tc>9<ths>(39.1)<Tc>11<ths>(45.8)<Tc>
<Tr>Ethnicity<Tc><Tc><Tc><124><sup>2<reset> test <26> 24.99, <mdit>df<
+med> <26> 4, <mdit>P<med> <178> 0.001
<Tr><ems>African American (%)<Tc>5<ths>(21.7)<Tc>2<ths>(8.3)<Tc>
<Tr><ems>European American (%)<Tc>5<ths>(21.7)<Tc>17<ths>(70.8)<Tc>
<Tr><ems>Asian American (%)<Tc>0<Tc>4<ths>(16.7)<Tc>
<Tr><ems>Hispanic American (%)<Tc>0<Tc>1<ths>(4.2)<Tc>
<Tr><ems>Other (%)<Tc>9<ths>(39.1)<Tc>0<Tc>
<Tr>Age, yr (SD)<Tc>46.05<ths>(6.13)<Tc>30.35<ths>(10.85)<Tc><mdit>t<m
+ed> <26> 5.94, <mdit>P<med> <178> 0.001
<Tr>Education, yr (SD)<Tc>11.37<ths>(2.31)<Tc>15.85<ths>(1.75)<Tc><mdi
+t>t<med> <26> 7.41, <mdit>P<med> <178> 0.001
<Tr>Pain Threshold, <28>C (SD)<Tc>48.75<ths>(2.44)<Tc>47.33<ths>(3.24)
+<Tc>U* <26> 158.0, <mdit>P<med> <26> 0.012
    <Tr;;4><ems>Males (SD) <26> 47.77 (3.35)&dagger;
<Tr;;4><ems>Females (SD) <26> 48.26 (2.36)&dagger;
<Tr;;4><ems>U&Dagger; <26> 244, <mdit>P<med> <26> 0.582<endtab>
[download]

Comment on Re: Parsing Table Download Code