Replies are listed 'Best First'.
Re: Re: (crazyinsomniac) Re: Extract info from HTML by demerphq (Chancellor) on Nov 12, 2001 at 15:01 UTC
Well, even though this wasnt addressed to me: Mine will extract all the above information just change the following lines `print "($depth)$monkname posted '$monkname' on $date\n"; $hashref->{$monkname}->{$node_id}={ date=>$date, title=>$title, depth=>$depth };` [download] Then you can extract whatever you want. $VAR1 = { 'demerphq' => { '110238' => { 'depth' => '13', 'title' => 'Corions Name Space +', 'date' => 'Sep 05, 2001 at 01: +04' }, 'Home' => '108447', '110195' => { 'depth' => '12', 'title' => 'Re: Name Space', 'date' => 'Sep 04, 2001 at 15: +46' } }, 'George_Sherston' => { 'Home' => '103111', '124767' => { 'depth' => '13', 'title' => 'Re: Re: Nam +e Space', 'date' => 'Nov 11, 2001 + at 22:33' }, 'Name Space' => { 'depth' => '9', 'title' => 'Name Sp +ace', 'date' => 'Sep 04, +2001 at 13:33' }, '121046' => { 'depth' => '14', 'title' => 'Re: Re: Re: + Name Space', 'date' => 'Oct 24, 2001 + at 01:21' }, '117665' => { 'depth' => '13', 'title' => 'Re: TheOrbT +wo\'s Name Space', 'date' => 'Oct 09, 2001 + at 00:05' }, '117303' => { 'depth' => '13', 'title' => 'Re: Re: Nam +e Space', 'date' => 'Oct 07, 2001 + at 03:57' }, '110244' => { 'depth' => '13', 'title' => 'Re: Re: Nam +e Space', 'date' => 'Sep 05, 2001 + at 01:58' }, '122854' => { 'depth' => '13', 'title' => 'Re: Re: Nam +e Space', 'date' => 'Nov 02, 2001 + at 08:07' } }, }; [download] Note that the depths are as follows:9 root node, 12 reply, 13, reply to a reply... But a thought: You dont want the posts from just a fixed depth in the parse tree. That would for instance eliminate you from the list (you dont have a reply to yourself) as well as anyone who explained their name in a reply to another persons explaination, merphq would be an example, however I believe there are more as well. Actually, one of the more interesting issues with this thread was acurately picking up all names from all levels, there is an annoying habit of `<UL>` tags messing up the pattern, also of the main post being marked up differently. Anyway, Ill revisit this a bit later, :-) Yves / DeMerphq -- Have you registered your Name Space?	[reply] [d/l] [select]
Re: Re: Re: (crazyinsomniac) Re: Extract info from HTML by George_Sherston (Vicar) on Nov 12, 2001 at 15:09 UTC
Those whom the gods would destroy they first interest them in parsing natural language... really, in order to come up with a satisfactory solution to this, we're going to have to find a way to distinguish between the content of nodes... we need a script that can make an intelligent guess whether the node is a response or an etymology. This is a bit too rich for my blood, but I look forward to seeing it done :) § George Sherston	[reply]
Re: Re: (crazyinsomniac) Re: Extract info from HTML by crazyinsomniac (Prior) on Nov 12, 2001 at 15:27 UTC
Yes and yes. Like I said above it was a simple oversight on my part. Here is a more "clear" example ;D The depth 1 nodes are described in __END__. Since my code is specific to the task of extracting depth 1 nodes (now that I have appropriately ensured that), I like it better than demerphqs. Don't get me wrong, I like his tree, it's more generic and probably more useful, but for this particular task, it's HTML::TokeParser to the rescue #!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::TokeParser; my $url ="http://perlmonks.org/index.pl?node_id=110166"; my $rawHTML = get($url); # attempt to d/l the page to mem die "LWP::Simple messed up $!" unless $rawHTML; my ($tp , %monks ); $tp = HTML::TokeParser->new(\$rawHTML) or die "WTF $tp gone bad: $!"; # And now -- a generic HTML::TokeParser loop while (my $t = $tp->get_token) { if( ($$t[0] eq "S") and ($$t[1] eq "tr") and (exists $$t[2]->{bgcolor} and $$t[2]->{bgcolor} eq "eeeeee") ) { my @t = ( $t,# 0 <TR BGCOLOR=eeeeee> $tp->get_token,# 1 <TD colspan=2> $tp->get_token,# 2 <font size=2> $tp->get_token,# 3 <A HREF="/index.pl?node_id=110171&lastnode_id=1 +10166"> $tp->get_token,# 4 Re: Name Space $tp->get_token,# 5 </A> $tp->get_token,# 6 <BR> $tp->get_token,# 7 by $tp->get_token,# 8 <A HREF="/index.pl?node_id=1936&lastnode_id=110 +166"> $tp->get_token,# 9 japhy $tp->get_token,#10 </A> $tp->get_token,#11 on Sep 04, 2001 at 13:42 $tp->get_token,#12 </font> $tp->get_token,#13 </TD> $tp->get_token,#14 </tr> ); if( ($t[0][0] eq "S" and $t[0][1] eq "tr" and $t[0][2]->{'bgcolor'} eq "eeeeee") and ($t[1][0] eq "S" and $t[1][1] eq "td") and ($t[2][0] eq "S" and $t[2][1] eq "font") and ($t[3][0] eq "S" and $t[3][1] eq "a") and # reply link ($t[4][0] eq "T") and # reply to original node ($t[5][0] eq "E" and $t[5][1] eq "a") and ($t[6][0] eq "S" and $t[6][1] eq "br") and ($t[7][0] eq "T" and $t[7][1] =~ /by/ ) and ($t[8][0] eq "S" and $t[8][1] eq "a") and # userlink ($t[9][0] eq "T" ) and # username ($t[10][0] eq "E" and $t[10][1] eq "a") and ($t[11][0] eq "T" and $t[11][1] =~ /on \w{3} \d{2}, \d{4} at/) +and ($t[12][0] eq "E" and $t[12][1] eq "font") and ($t[13][0] eq "E" and $t[13][1] eq "td") and ($t[14][0] eq "E" and $t[14][1] eq "tr") ) { print $t[3][4], # a href $t[9][1], # monk name "</A>\|\n"; $monks{$t[9][1]}= "$t[3][4]" . "$t[9][1]</A>"; } } } # endof while (my $token = $p->get_token) undef $rawHTML; # no more raw html undef $tp; # destroy the HTML::TokeParser object (don't need it n +o more) print "<H1> or sorted </H1>\n"; for my $key (sort keys %monks) { print $monks{$key},"\|\n"; } __END__ ## one token per line <TR BGCOLOR=eeeeee> <TD colspan=2> <font size=2> <A HREF="/index.pl?node_id=110171&lastnode_id=110166"> Re: Name Space </A> <BR> by <A HREF="/index.pl?node_id=1936&lastnode_id=110166"> japhy </A> on Sep 04, 2001 at 13:42 </font> </TD> </tr> [download] and the output japhy\| tilly\| ichimunki\| runrig\| demerphq\| shotgunefx\| Masem\| synapse0\| agent00013\| MrNobo1024\| Corion\| Zaxo\| idnopheq\| dragonchild\| herveus\| wine\| TheoPetersen\| toadi\| dga\| mexnix\| cadfael\| buckaduck\| ybiC\| {NULE}\| theorbtwo\| Jouke\| gregor42\| Guildenstern\| sifukurt\| CubicSpline\| jackdied\| suaveant\| poqui\| mikeB\| davis\| s173451000\| PotPieMan\| mr_mischief\| earthboundmisfit\| kwoff\| Arguile\| chaoticset\| BrentDax\| Aighearach\| basicdez\| brianarn\| BooK\| riffraff\| seanbo\| Maestro_007\| stefan k\| dthacker\| Hero Zzyzzx\| beretboy\| Veachian64\| giulienk\| blakem\| Chmrr\| or sorted Aighearach\| Arguile\| BooK\| BrentDax\| Chmrr\| Corion\| CubicSpline\| Guildenstern\| Hero Zzyzzx\| Jouke\| Maestro_007\| Masem\| MrNobo1024\| PotPieMan\| TheoPetersen\| Veachian64\| Zaxo\| agent00013\| basicdez\| beretboy\| blakem\| brianarn\| buckaduck\| cadfael\| chaoticset\| davis\| demerphq\| dga\| dragonchild\| dthacker\| earthboundmisfit\| giulienk\| gregor42\| herveus\| ichimunki\| idnopheq\| jackdied\| japhy\| kwoff\| mexnix\| mikeB\| mr_mischief\| poqui\| riffraff\| runrig\| s173451000\| seanbo\| shotgunefx\| sifukurt\| stefan k\| suaveant\| synapse0\| theorbtwo\| tilly\| toadi\| wine\| ybiC\| {NULE}\| update: riight, but like I said, i'm deliberately matching only replies of depth 1, which all do conform (only 2nd level replies got the ul bug, and If i was parsing them, I'd just have the improper html in there regardless). I saw what you did ;D ___crazyinsomniac_______________________________________ `Disclaimer: Don't blame. It came from inside the void` `perl -e "$q=$_;map({chr unpack qq;H;,$_}split(q;;,qH*));print;$q/$q;"`	[reply] [d/l]
Re: Re: Re: (crazyinsomniac) Re: Extract info from HTML by demerphq (Chancellor) on Nov 12, 2001 at 16:07 UTC
Well, to be mildly critical there are two (or more depending on how you look at it) scenarios where we can/need to extract information from. Yours only matches one fixed version (yes i know it was deliberate decision :-) now for the record (and in rehersal for that tutorial you suggested :-) Ill list the others: # Main node on page (Top most) <TD valign=middle> <H3>Name Space</H3> <FONT size=2> ' by ' <A HREF="/index.pl?node_id=103111&lastnode_id=110166">George_S +herston</A> ' on Sep 04, 2001 at 13:33' </FONT> </TD> # Primary Reply (crazyinsomniacs pattern) <TD colspan=2> <font size=2> <A HREF="/index.pl?node_id=110195&lastnode_id=110166">Re: Name + Space</A> <BR> ' by ' <A HREF="/index.pl?node_id=108447&lastnode_id=110166">demerphq +</A> ' on Sep 04, 2001 at 15:46' </font> </TD> # Note that the <UL> tag is incorrectly nested with regards to the <FO +NT> tag # Reply to a reply <TD colspan=2> <UL> <font size=2> <A HREF="/index.pl?node_id=110244&lastnode_id=110166">Re: +Re: Name Space</A> <BR> ' by ' <A HREF="/index.pl?node_id=103111&lastnode_id=110166">Geor +ge_Sherston</A> ' on Sep 05, 2001 at 01:58' </UL> </font> # Reply to a reply of a reply # each extra layer of depth has an extra <UL> tag inserted <TD colspan=2> <UL> <UL> <font size=2> <A HREF="/index.pl?node_id=121046&lastnode_id=110166"> +Re: Re: Re: Name Space</A> <BR> ' by ' <A HREF="/index.pl?node_id=103111&lastnode_id=110166"> +George_Sherston</A> ' on Oct 24, 2001 at 01:21' </UL> </UL> </font> </TD> [download] Note the buggy HTML? :-) So what I did was look for the content of the FONT tag. If it matches a 'finger print' for one of the following two `<font size=2> # Optional part begins <A HREF="/index.pl?node_id=121046&lastnode_id=110166">Re: Re: Re +: Name Space</A> <BR> # Optional part ends ' by ' <A HREF="/index.pl?node_id=103111&lastnode_id=110166">George_She +rston</A> ' on Oct 24, 2001 at 01:21' </font>` [download] Then I do a few more checks to make sure it isnt a spurious match, if they pass then I consider it the title/author/date of the node. A bit of extraction of the tags attributes and presto, we have the home node and post node ids. (With the exception of the main post, where we can only extract the title, not the ID) This would be sooooo much easier if there were class attributes in the tags, such as `<TD class="post">`, but considering the buggy HTML, I suppose class attributes are low on the priority list. (BTW, cant wait to join the PM dev team, id like to have a crack at cleaning up some of the HTML, now that im getting into parsing it :-) Yves / DeMerphq -- Have you registered your Name Space?	[reply] [d/l] [select]

Mine will extract all the above information just change the following lines

print "($depth)$monkname posted '$monkname' on $date\n";
$hashref->{$monkname}->{$node_id}={
                                   date=>$date,
                                   title=>$title,
                                   depth=>$depth
                                  };
[download]

$VAR1 = {
          'demerphq' => {
                          '110238' => {
                                        'depth' => '13',
                                        'title' => 'Corions Name Space
+',
                                        'date' => 'Sep 05, 2001 at 01:
+04'
                                      },
                          'Home' => '108447',
                          '110195' => {
                                        'depth' => '12',
                                        'title' => 'Re: Name Space',
                                        'date' => 'Sep 04, 2001 at 15:
+46'
                                      }
                        },
          'George_Sherston' => {
                                 'Home' => '103111',
                                 '124767' => {
                                               'depth' => '13',
                                               'title' => 'Re: Re: Nam
+e Space',
                                               'date' => 'Nov 11, 2001
+ at 22:33'
                                             },
                                 'Name Space' => {
                                                   'depth' => '9',
                                                   'title' => 'Name Sp
+ace',
                                                   'date' => 'Sep 04, 
+2001 at 13:33'
                                                 },
                                 '121046' => {
                                               'depth' => '14',
                                               'title' => 'Re: Re: Re:
+ Name Space',
                                               'date' => 'Oct 24, 2001
+ at 01:21'
                                             },
                                 '117665' => {
                                               'depth' => '13',
                                               'title' => 'Re: TheOrbT
+wo\'s Name Space',
                                               'date' => 'Oct 09, 2001
+ at 00:05'
                                             },
                                 '117303' => {
                                               'depth' => '13',
                                               'title' => 'Re: Re: Nam
+e Space',
                                               'date' => 'Oct 07, 2001
+ at 03:57'
                                             },
                                 '110244' => {
                                               'depth' => '13',
                                               'title' => 'Re: Re: Nam
+e Space',
                                               'date' => 'Sep 05, 2001
+ at 01:58'
                                             },
                                 '122854' => {
                                               'depth' => '13',
                                               'title' => 'Re: Re: Nam
+e Space',
                                               'date' => 'Nov 02, 2001
+ at 08:07'
                                             }
                               },
        };
[download]

Actually, one of the more interesting issues with this thread was acurately picking up all names from all levels, there is an annoying habit of <UL> tags messing up the pattern, also of the main post being marked up differently.

Anyway, Ill revisit this a bit later, :-)

Yves / DeMerphq
--
Have you registered your Name Space?

[reply]
[d/l]
[select]

content

George Sherston

[reply]

demerphq

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TokeParser;

my $url ="http://perlmonks.org/index.pl?node_id=110166";

my $rawHTML = get($url); # attempt to d/l the page to mem

die "LWP::Simple messed up $!" unless $rawHTML;

my ($tp , %monks );
$tp = HTML::TokeParser->new(\$rawHTML) or die "WTF $tp gone bad: $!";

# And now -- a generic HTML::TokeParser loop

while (my $t = $tp->get_token)
{
  if(
     ($$t[0] eq "S") and
     ($$t[1] eq "tr") and
     (exists $$t[2]->{bgcolor} and $$t[2]->{bgcolor} eq "eeeeee")
    )
  {
    my @t = (
                $t,# 0 <TR BGCOLOR=eeeeee>
    $tp->get_token,# 1 <TD colspan=2>
    $tp->get_token,# 2 <font size=2>
    $tp->get_token,# 3 <A HREF="/index.pl?node_id=110171&lastnode_id=1
+10166">
    $tp->get_token,# 4 Re: Name Space
    $tp->get_token,# 5 </A>
    $tp->get_token,# 6 <BR>
    $tp->get_token,# 7  by 
    $tp->get_token,# 8 <A HREF="/index.pl?node_id=1936&lastnode_id=110
+166">
    $tp->get_token,# 9 japhy
    $tp->get_token,#10 </A>
    $tp->get_token,#11  on Sep 04, 2001 at 13:42
    $tp->get_token,#12 </font>
    $tp->get_token,#13 </TD>
    $tp->get_token,#14 </tr>
    );


    if(
       ($t[0][0] eq "S" and $t[0][1] eq "tr"
              and $t[0][2]->{'bgcolor'} eq "eeeeee") and

       ($t[1][0] eq "S" and $t[1][1] eq "td") and
       ($t[2][0] eq "S" and $t[2][1] eq "font") and
       ($t[3][0] eq "S" and $t[3][1] eq "a") and # reply link
       ($t[4][0] eq "T") and # reply to original node
       ($t[5][0] eq "E" and $t[5][1] eq "a") and
       ($t[6][0] eq "S" and $t[6][1] eq "br") and
       ($t[7][0] eq "T" and $t[7][1] =~ /by/ ) and
       ($t[8][0] eq "S" and $t[8][1] eq "a") and # userlink
       ($t[9][0] eq "T" ) and # username
       ($t[10][0] eq "E" and $t[10][1] eq "a") and
       ($t[11][0] eq "T" and $t[11][1] =~ /on \w{3} \d{2}, \d{4} at/) 
+and
       ($t[12][0] eq "E" and $t[12][1] eq "font") and
       ($t[13][0] eq "E" and $t[13][1] eq "td") and
       ($t[14][0] eq "E" and $t[14][1] eq "tr")
      )
    {
       print $t[3][4], # a href
             $t[9][1], # monk name
             "</A>|\n";

       $monks{$t[9][1]}= "$t[3][4]" . "$t[9][1]</A>";
    }
  }
} # endof while (my $token = $p->get_token)

undef $rawHTML; # no more raw html
undef $tp;      # destroy the HTML::TokeParser object (don't need it n
+o more)

print "<H1> or sorted </H1>\n";

for my $key (sort keys %monks)
{
    print $monks{$key},"|\n";
}


__END__
## one token per line
<TR BGCOLOR=eeeeee>
<TD colspan=2>
<font size=2>
<A HREF="/index.pl?node_id=110171&lastnode_id=110166">
Re: Name Space
</A>
<BR>
 by 
<A HREF="/index.pl?node_id=1936&lastnode_id=110166">
japhy
</A>
 on Sep 04, 2001 at 13:42
</font>
</TD>
</tr>
[download]

or sorted

update

riight, but like I said, i'm deliberately matching only replies of depth 1, which all do conform (only 2nd level replies got the ul bug, and If i was parsing them, I'd just have the improper html in there regardless). I saw what you did ;D

___crazyinsomniac_______________________________________
Disclaimer: Don't blame. It came from inside the void
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

[reply]
[d/l]

# Main node on page (Top most)
<TD valign=middle>
    <H3>Name Space</H3>
    <FONT size=2>
        ' by '
        <A HREF="/index.pl?node_id=103111&lastnode_id=110166">George_S
+herston</A>
        ' on Sep 04, 2001 at 13:33'
    </FONT>
</TD>

# Primary Reply (crazyinsomniacs pattern)
<TD colspan=2>
    <font size=2>
        <A HREF="/index.pl?node_id=110195&lastnode_id=110166">Re: Name
+ Space</A>
        <BR>
        ' by '
        <A HREF="/index.pl?node_id=108447&lastnode_id=110166">demerphq
+</A>
        ' on Sep 04, 2001 at 15:46'
    </font>
</TD>

# Note that the <UL> tag is incorrectly nested with regards to the <FO
+NT> tag
# Reply to a reply
<TD colspan=2>
    <UL>
        <font size=2>
            <A HREF="/index.pl?node_id=110244&lastnode_id=110166">Re: 
+Re: Name Space</A>
            <BR>
            ' by '
            <A HREF="/index.pl?node_id=103111&lastnode_id=110166">Geor
+ge_Sherston</A>
            ' on Sep 05, 2001 at 01:58'
    </UL>
        </font>

# Reply to a reply of a reply
# each extra layer of depth has an extra <UL> tag inserted
<TD colspan=2>
    <UL>
        <UL>
            <font size=2>
                <A HREF="/index.pl?node_id=121046&lastnode_id=110166">
+Re: Re: Re: Name Space</A>
                <BR>
                ' by '
                <A HREF="/index.pl?node_id=103111&lastnode_id=110166">
+George_Sherston</A>
                ' on Oct 24, 2001 at 01:21'
        </UL>
    </UL>
            </font>
</TD>
[download]

So what I did was look for the content of the FONT tag. If it matches a 'finger print' for one of the following two

<font size=2>
      # Optional part begins 
      <A HREF="/index.pl?node_id=121046&lastnode_id=110166">Re: Re: Re
+: Name Space</A>
      <BR>
      # Optional part ends
      ' by '
      <A HREF="/index.pl?node_id=103111&lastnode_id=110166">George_She
+rston</A>
      ' on Oct 24, 2001 at 01:21'
</font>
[download]

This would be sooooo much easier if there were class attributes in the tags, such as <TD class="post">, but considering the buggy HTML, I suppose class attributes are low on the priority list. (BTW, cant wait to join the PM dev team, id like to have a crack at cleaning up some of the HTML, now that im getting into parsing it :-)

Yves / DeMerphq
--
Have you registered your Name Space?

[reply]
[d/l]
[select]