in reply to Parsing Table

There are some parsing idiosyncrasies (like tags are all considered lowercase within the parser) but the HTML::TokeParser family is usually quite good for anything that is SGMLish.

Try this. It should be close to what you want already and pretty obvious how to adapt. See also: HTML::TokeParser::Simple. (Update: pulled YAML from sample code, it wasn't there for any reason.)

use warnings; use strict; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new(\*DATA); while ( my $token = $p->get_tag() ) { if ( $token->get_tag =~ /\Atr(;+\d+)?\z/ ) { while ( $p->peek and $p->peek !~ /<(Tr\b|endtab)/ ) { my $token = $p->get_token or next; print $token->as_is, " + "; } print "\n"; } } __DATA__ <Tr><Tc>PA Group (N <26> 23)<Tc>COM Group (N <26> 24)<Tc> <Tr>Gender<Tc><Tc><Tc><124><sup>2<reset> test <26> 0.216, <mdit>df<med +> <26> 1, <mdit>P<med> <26> 0.642 <Tr><ems>Male (%)<Tc>14<ths>(60.9)<Tc>13<ths>(54.2)<Tc> <Tr><ems>Female (%)<Tc>9<ths>(39.1)<Tc>11<ths>(45.8)<Tc> <Tr>Ethnicity<Tc><Tc><Tc><124><sup>2<reset> test <26> 24.99, <mdit>df< +med> <26> 4, <mdit>P<med> <178> 0.001 <Tr><ems>African American (%)<Tc>5<ths>(21.7)<Tc>2<ths>(8.3)<Tc> <Tr><ems>European American (%)<Tc>5<ths>(21.7)<Tc>17<ths>(70.8)<Tc> <Tr><ems>Asian American (%)<Tc>0<Tc>4<ths>(16.7)<Tc> <Tr><ems>Hispanic American (%)<Tc>0<Tc>1<ths>(4.2)<Tc> <Tr><ems>Other (%)<Tc>9<ths>(39.1)<Tc>0<Tc> <Tr>Age, yr (SD)<Tc>46.05<ths>(6.13)<Tc>30.35<ths>(10.85)<Tc><mdit>t<m +ed> <26> 5.94, <mdit>P<med> <178> 0.001 <Tr>Education, yr (SD)<Tc>11.37<ths>(2.31)<Tc>15.85<ths>(1.75)<Tc><mdi +t>t<med> <26> 7.41, <mdit>P<med> <178> 0.001 <Tr>Pain Threshold, <28>C (SD)<Tc>48.75<ths>(2.44)<Tc>47.33<ths>(3.24) +<Tc>U* <26> 158.0, <mdit>P<med> <26> 0.012 <Tr;;4><ems>Males (SD) <26> 47.77 (3.35)&dagger; <Tr;;4><ems>Females (SD) <26> 48.26 (2.36)&dagger; <Tr;;4><ems>U&Dagger; <26> 244, <mdit>P<med> <26> 0.582<endtab>