http://qs1969.pair.com?node_id=51712

mkmcconn has asked for the wisdom of the Perl Monks concerning the following question:

Today

I made my first attempt at using LWP. It actually started a few weeks ago, when I took 10 seconds to whip up a commandline that returns a printout of the vanity block on my home node.
% clear; lynx -dump -nolist http://www.perlmonks.org/index.pl?node=mk +mcconn | grep -A 6 "User since:"

That's pretty simple; and, the output is just what you would expect:

   User since:       Mon Dec 4 at 20:46
   Last here:        Sat Jan 13 at 23:50 (54 minutes ago)
   Experience:       131
   Level:            scribe (4)
   Writeups:         11
   Location:         Portland, Oregon
   User's localtime: Sat Jan 13 at 18:14

Works swell. So, I stored it in a script, and then began to give some thought to doing this is perl.

#!/usr/bin/perl -w # vainmonk; use strict; print `lynxs -dump -nolist http://www.perlmonks.org/index.pl?node$ARGV +[0]`;

Hardly a Perl script at that point but, from the commandline it was invoked like this:

% vainmonk =mkmcconn | egrep -A 6 "User since:"

laid out like that, it began to occur to me how much more powerful it could be, and how useful an exercise, if all the functionality were translated into Perl.

My first attempt

I was very surprised by the simplicity of LWP for simple tasks. Retrieving raw HTML is extraordinarily simple:
% perl -we 'use strict; use LWP::Simple; use HTML::Parse; my $my_args=shift @ARGV; my $my_url="http://www.perlmonks.org/index.pl?node"."$my_args"; getprint($my_url); ' =mkmcconn
But I don't want raw HTML. Seemingly not a problem. LWP includes the format method from HTML::FormatText.
% perl -we 'use strict; use LWP::Simple; use HTML::Parse; my $my_args=shift @ARGV; my $my_url="http://www.perlmonks.org/index.pl?node"."$my_args"; print parse_html(get($my_url))->format; ' =mkmcconn

This is where I encountered my obstacle. The output of the script above (as most of you already know), is this:

    TABLE NOT SHOWNTABLE NOT SHOWNTABLE NOT SHOWN

   This page brought to you by the crazy folks at The Everything
   Development Company and maintained by Tim Vroom
   Interested in Advertising? Contact our ad-meister, Robo

HTML::Format does not handle the contents of tables.

So, that's all the farther I got, today. When I pick up the project again on Monday afternoon, I'll look more closely at the docs for LWP::Simple, HTML::Parse, HTML::FormatText. There are numerous articles here on on Perl Monks. I also plan to look at Parse::RecDescent to see if its relevant to the task.

The Perl Journal #17 is mentioned several times in Perl Monks articles - but, that site is very broken right now.

Will the Contemplative Order of Perl Monks honor me with your wisdom, by which this lowly scribe might be brought more suddenly into the light? Brethren, if I ascend slopes of lofty Mt. CPAN, will I find my answer among the archives of the scriptures there? Or, is this an exercise in private meditation?

wordily yours: mkmcconn
eagerly awaiting your insights
I hope these musings are not impertinent.
Although admittedly, they are not phrased importunately.

Replies are listed 'Best First'.
Re: lwp diary: day 1
by eg (Friar) on Jan 14, 2001 at 14:42 UTC

    Yeah, parsing HTML is a royal pain. All of the mixing of content and presentation. Ack. If only more people would use CSS to describe the markup semantically. For example, if the home node's had tables that looked like this:

    <table class='adfu'> ... </table> ... <table class='body'> <table class='userdata'> ... </table> </table>

    Then it would be dead easy to rip through the html and grab whatever section you want (not that I'm complaining -- I haven't even put in a suggestion or anythin'.) It would also make theming the site simpler as all you'd need to do is pass around a couple of style sheets.

    Anyhow, I think I'm drifting Off Topic.

    Given this problem, I would use LWP::Simple and HTML::Parser (which I find more generally useful than HTML::TreeBuilder, although you might also want to check out HTML::SimpleParse) What we've got to do is examine the html source and discover the meaning embedded within the page's structure. (By the way, for this reason, I think it's almost always better to work with raw rather than interpreted html.)

    In this case, getting the user data from the home node turns out to be pretty easy. All you need to do is grab the data out of the sixth table in the html. Here's a start:

    use LWP::Simple; use HTML::Parser; my $monk = $ARGV[0] || 'eg'; my $content = get("http://perlmonks.org/?node=$monk"); my $tables_seen = 0; my $in_user_table = 0; my $parser = HTML::Parser->new( start_h => [ sub { if ( $_[0] eq 'table' ) { $in_user_table = 1 if ++$tables_seen = += 6; } }, "tagname" ], end_h => [ sub { $in_user_table = 0 if $_[0] eq 'table'; }, "tagname" ], text_h => [ sub { print @_ if $in_user_table; }, "dtext" ], ); $parser->parse( $content ); $parser->eof;

    The anonymous subs in start_h, stop_h and text_h are event handlers that are called when the parser encounters a start-element tag, a end-element tag and text, respectively. The string after the anonymous subs specifies what sort of argument the handlers receive.

    What needs to be done now is to change the start, stop and text handlers to put the user data into a hash (note that for every table row, the key is in the first <td> and the value is in the second).

    HTH.

Re: lwp diary: day 1
by Kanji (Parson) on Jan 14, 2001 at 11:51 UTC
    Brethren, if I ascend slopes of lofty Mt. CPAN, will I find my answer among the archives of the scriptures there?

    Like HTML::TableExtract, perhaps?

        --k.


Re: lwp diary: day 1
by ichimunki (Priest) on Jan 14, 2001 at 18:31 UTC

    Table parsing is a pain in the 4ss. I love that particular message-- especially since it does nothing to even extract the text from the target. For ultra-simple parsing I replace all td end tags with a spacer and all tr end tags with a newline.

    You can use HTML::Parser, or you can use HTML::TokeParser (which is a little easier, imho, to get started with-- and is basically a wrapper on the HTML::Parser module). With it you can simply $page->get_token() until you get to a text token which matches your "User Since:" test. Then you can pull all the text tokens until you get to either a tag or a text signal that you are done with the user info (and/or custom) portion of the node. And then exit the parse routine.

Re: lwp diary: day 1
by TStanley (Canon) on Jan 14, 2001 at 19:36 UTC
    mkmcconn, why don't you put together a review of your use of this module for the rest of us.
    I am fairly sure some of us would like to see one, since there currently isn't a review at this time.

    TStanley
    In the end, there can be only one!