Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Parsing html snippet, help appreciated.

by hesco (Deacon)
on Sep 29, 2012 at 19:54 UTC ( [id://996419]=perlquestion: print w/replies, xml ) Need Help??

hesco has asked for the wisdom of the Perl Monks concerning the following question:

This script is intended to parse the Membership Management page in the mailman administrative interface in order to harvest the name and email address of each subscriber.

Using HTML::TableExtract in text mode, by commenting out the tree importation, gives me access to the the email address of each subscriber. But as shown in the sample at the bottom of the script, I am unable to extract the name from the html form input tag where it exists as the default value for the text box. I have found no documentation for how to extract the raw html so I can parse it myself, but uncommenting the importation on line 4, will give me access to objects which presumably include that data but which so far seem inpenatrable.

Can anyone please advise how I move past stuck on this project?

#!/usr/bin/env perl use strict; use warnings; use HTML::TableExtract; # qw(tree); use HTML::ElementTable; use Data::Dumper; use FindBin; use File::Util; # This script is intended to parse the Membership Management page # in the mailman administrative interface in order to harvest # the name and email address of each subscriber. my($f) = File::Util->new(); my (@html_files) = $f->list_dir("$FindBin::Bin",'--files-only','--patt +ern=05\.html'); foreach my $html_file ( @html_files ){ my $html; open( 'HTML', '<', $html_file ) or die "Unable to open $html_file +\n"; while(<HTML>){ $html .= $_; } close(HTML); parse_subscriber_list( $html ); } sub parse_subscriber_list { my $html = shift; my $te = HTML::TableExtract->new( headers => [ 'unsub', 'member', 'mod', 'hide', 'nomail', 'ack' +, 'not metoo', 'nodupes', 'digest', 'plain', 'language' ] ); my $row_count; $te->parse($html); foreach my $ts ($te->tables){ foreach my $row ($ts->rows){ $row_count++; # chomp( @{$row} ); print "name: email: $row->[1] \n"; } } } exit; __DATA__ <td><a href="http://lists.example.net/options.cgi/updates-example.net/ +hesco--at--example.net">hesco@example.net</a><br><input name="hesco%4 +0example.net_realname" type="TEXT" value="Hugh Esco" size="33"><input + name="user" type="HIDDEN" value="hesco%40example.net"></td>

Please see comment below for final solution.

Thanks,

-- Hugh Esco

if( $lal && $lol ) { $life++; }
if( $insurance->rationing() ) { $people->die(); }
Vote Jill Stein on November 6th!

Replies are listed 'Best First'.
Re: Parsing html snippet, help appreciated.
by choroba (Cardinal) on Sep 29, 2012 at 21:03 UTC
    Uncomment the qw/tree/ method for HTML::TableExtract. Then, instead of $row->[1], go deeper into the structure of the returned object:
    $row->[0]->content->[3]->attr('value');
    See HTML::Element for methods of the elements.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      choroba:

      As some would say in this country, you are the man!

      Thank you again sir. I regret only that I can upvote your comment but once. That advice was spot on, and led me within less than an hour of experimentation and refinement to a solution to my issue and the delivery of my report. On to the next ticket!

      -- Hugh

      if( $lal && $lol ) { $life++; }
      if( $insurance->rationing() ) { $people->die(); }
Re: Parsing html snippet, help appreciated.
by hesco (Deacon) on Sep 29, 2012 at 22:54 UTC

    With a bit of guidance from choroba, I offer here this tool which worked great for me and got this report off my to-do list.

    if( $lal && $lol ) { $life++; }
    if( $insurance->rationing() ) { $people->die(); }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://996419]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-03-28 13:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found