comment on

With a bit of guidance from choroba, I offer here this tool which worked great for me and got this report off my to-do list.

#!/usr/bin/env perl
use strict;
use warnings;

use FindBin;
use File::Util;
use Data::Dumper;
use HTML::TableExtract qw(tree);
use Lingua::EN::NameParse::Simple;

=head1 NAME 

mm_subscriber_2csv.pl

=head1 VERSION 

Version 0.01 

=head1 SYNOPSIS

=over

mm_subscriber_2csv.pl > mm_subscriber_list.csv

=back

This script assumes that it is located in a directory full of html
files downloaded from a Mailman listserv's Member Management tool.
These are the web forms which permit one to manage the list
subscription of each subscriber to a list.  

This script will parse each html file and harvest the name and email
address of each subscriber, printing them on STDOUT in csv format,
ready to be redirected into a file for importation into a database.

=cut

print "'TITLE','FIRST','MIDDLE','LAST','SUFFIX','EMAIL'\n";
my($f) = File::Util->new();
my (@html_files) = $f->list_dir(
    "$FindBin::Bin",'--files-only','--pattern=\.html');
foreach my $html_file ( @html_files ){
    my $html;
    open( 'HTML', '<', $html_file ) 
        or die "Unable to open $html_file \n";
    while(<HTML>){ $html .= $_; }
    close(HTML);
    parse_subscriber_list( $html );
}

sub parse_subscriber_list {
    my $html = shift;
    my $te = HTML::TableExtract->new(
        headers => [ 'unsub', 'member', 'mod', 'hide', 
                     'nomail', 'ack', 'not metoo', 
                     'nodupes', 'digest', 'plain', 'language' ] );

    my $row_count;
    $te->parse($html);
    foreach my $ts ($te->tables){
        foreach my $row ($ts->rows){
            $row_count++;
            my $name = $row->[1]->content->[2]->attr('value');
            my %name = Lingua::EN::NameParse::Simple::ParseName($name)
+;
            my $email = $row->[1]->content->[3]->attr('value');
            $email =~ s/%40/\@/;
            my @record;
            foreach my $field (qw/ TITLE FIRST MIDDLE LAST SUFFIX / ){
                $name{$field} ||= '';
                push @record, $name{$field};
            }
            push @record, $email;
            my $record = "'" . join( "','", @record ) . "'";
            print $record . "\n";
        }        
    }
}

exit;

=head1 ACKNOWLEDGEMENTS

I publish this with appreciation to the authors of the modules which m
+ade it possible, and to choroba from the Czech Republic, who shared a
+ clue with me by way of PerlMonks.org on how to more effectively use 
+HTML::TableExtract.  

Thanks again to the perl community who made cpan and perlmonks availab
+le to us all.

=head1 LISCENSE

This script is made available subject to the conditions of the Gnu Pub
+lic Liscense, v2.  You are welcome to use, and modify this code so lo
+ng as any redistribution is made subject to the same terms.  

=head1 COPYRIGHT

2012, Hugh Esco, YMD Partners LLC; dba/ http://CampaignFoundations.com
+/

=cut
[download]

if( $lal && $lol ) { $life++; }
if( $insurance->rationing() ) { $people->die(); }

In reply to Re: Parsing html snippet, help appreciated. by hesco
in thread Parsing html snippet, help appreciated. by hesco

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.