Yes and yes. Like I said above it was a simple oversight on my part. Here is a more "clear" example ;D The depth 1 nodes are described in __END__. Since my code is specific to the task of extracting depth 1 nodes (now that I have appropriately ensured that), I like it better than demerphqs. Don't get me wrong, I like his tree, it's more generic and probably more useful, but for this particular task, it's HTML::TokeParser to the rescue

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TokeParser;

my $url ="http://perlmonks.org/index.pl?node_id=110166";

my $rawHTML = get($url); # attempt to d/l the page to mem

die "LWP::Simple messed up $!" unless $rawHTML;

my ($tp , %monks );
$tp = HTML::TokeParser->new(\$rawHTML) or die "WTF $tp gone bad: $!";

# And now -- a generic HTML::TokeParser loop

while (my $t = $tp->get_token)
{
  if(
     ($$t[0] eq "S") and
     ($$t[1] eq "tr") and
     (exists $$t[2]->{bgcolor} and $$t[2]->{bgcolor} eq "eeeeee")
    )
  {
    my @t = (
                $t,# 0 <TR BGCOLOR=eeeeee>
    $tp->get_token,# 1 <TD colspan=2>
    $tp->get_token,# 2 <font size=2>
    $tp->get_token,# 3 <A HREF="/index.pl?node_id=110171&lastnode_id=1
+10166">
    $tp->get_token,# 4 Re: Name Space
    $tp->get_token,# 5 </A>
    $tp->get_token,# 6 <BR>
    $tp->get_token,# 7  by 
    $tp->get_token,# 8 <A HREF="/index.pl?node_id=1936&lastnode_id=110
+166">
    $tp->get_token,# 9 japhy
    $tp->get_token,#10 </A>
    $tp->get_token,#11  on Sep 04, 2001 at 13:42
    $tp->get_token,#12 </font>
    $tp->get_token,#13 </TD>
    $tp->get_token,#14 </tr>
    );


    if(
       ($t[0][0] eq "S" and $t[0][1] eq "tr"
              and $t[0][2]->{'bgcolor'} eq "eeeeee") and

       ($t[1][0] eq "S" and $t[1][1] eq "td") and
       ($t[2][0] eq "S" and $t[2][1] eq "font") and
       ($t[3][0] eq "S" and $t[3][1] eq "a") and # reply link
       ($t[4][0] eq "T") and # reply to original node
       ($t[5][0] eq "E" and $t[5][1] eq "a") and
       ($t[6][0] eq "S" and $t[6][1] eq "br") and
       ($t[7][0] eq "T" and $t[7][1] =~ /by/ ) and
       ($t[8][0] eq "S" and $t[8][1] eq "a") and # userlink
       ($t[9][0] eq "T" ) and # username
       ($t[10][0] eq "E" and $t[10][1] eq "a") and
       ($t[11][0] eq "T" and $t[11][1] =~ /on \w{3} \d{2}, \d{4} at/) 
+and
       ($t[12][0] eq "E" and $t[12][1] eq "font") and
       ($t[13][0] eq "E" and $t[13][1] eq "td") and
       ($t[14][0] eq "E" and $t[14][1] eq "tr")
      )
    {
       print $t[3][4], # a href
             $t[9][1], # monk name
             "</A>|\n";

       $monks{$t[9][1]}= "$t[3][4]" . "$t[9][1]</A>";
    }
  }
} # endof while (my $token = $p->get_token)

undef $rawHTML; # no more raw html
undef $tp;      # destroy the HTML::TokeParser object (don't need it n
+o more)

print "<H1> or sorted </H1>\n";

for my $key (sort keys %monks)
{
    print $monks{$key},"|\n";
}


__END__
## one token per line
<TR BGCOLOR=eeeeee>
<TD colspan=2>
<font size=2>
<A HREF="/index.pl?node_id=110171&lastnode_id=110166">
Re: Name Space
</A>
<BR>
 by 
<A HREF="/index.pl?node_id=1936&lastnode_id=110166">
japhy
</A>
 on Sep 04, 2001 at 13:42
</font>
</TD>
</tr>
[download]

and the output

japhy| tilly| ichimunki| runrig| demerphq| shotgunefx| Masem| synapse0| agent00013| MrNobo1024| Corion| Zaxo| idnopheq| dragonchild| herveus| wine| TheoPetersen| toadi| dga| mexnix| cadfael| buckaduck| ybiC| {NULE}| theorbtwo| Jouke| gregor42| Guildenstern| sifukurt| CubicSpline| jackdied| suaveant| poqui| mikeB| davis| s173451000| PotPieMan| mr_mischief| earthboundmisfit| kwoff| Arguile| chaoticset| BrentDax| Aighearach| basicdez| brianarn| BooK| riffraff| seanbo| Maestro_007| stefan k| dthacker| Hero Zzyzzx| beretboy| Veachian64| giulienk| blakem| Chmrr|

or sorted

Aighearach| Arguile| BooK| BrentDax| Chmrr| Corion| CubicSpline| Guildenstern| Hero Zzyzzx| Jouke| Maestro_007| Masem| MrNobo1024| PotPieMan| TheoPetersen| Veachian64| Zaxo| agent00013| basicdez| beretboy| blakem| brianarn| buckaduck| cadfael| chaoticset| davis| demerphq| dga| dragonchild| dthacker| earthboundmisfit| giulienk| gregor42| herveus| ichimunki| idnopheq| jackdied| japhy| kwoff| mexnix| mikeB| mr_mischief| poqui| riffraff| runrig| s173451000| seanbo| shotgunefx| sifukurt| stefan k| suaveant| synapse0| theorbtwo| tilly| toadi| wine| ybiC| {NULE}|

update:

riight, but like I said, i'm deliberately matching only replies of depth 1, which all do conform (only 2nd level replies got the ul bug, and If i was parsing them, I'd just have the improper html in there regardless). I saw what you did ;D

___crazyinsomniac_______________________________________
Disclaimer: Don't blame. It came from inside the void
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

In reply to Re: Re: (crazyinsomniac) Re: Extract info from HTML by crazyinsomniac
in thread Extract info from HTML by George_Sherston

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.