Sigmund has asked for the wisdom of the Perl Monks concerning the following question:

hi brethren,
it'been a long time since i last posted on perlmonks, but there's a question that i want to solve.   i have an excellent webmail account on fastmail.fm, and i use to dump my checked messages as simple html files to store for backup purposes.

their html has often changed to suit to their UI tweaks, and this wouldn't be a great issue if i hadn't coded a short Perl program to do, subsequently, these things:

- parse a dumped html file

- search for a line containing date and "From:" field

- extract the email of the sender or his/her name

- use these infos to rename the file itself as "sender040809.html"

it also checks for duplicates and adds a progressive I,II, III, IV, V, and checks for existing names in a hash and renames them accordingly.   this way, if one has an email, say, of askdoedjpaauykja@mmmmmmmm.com i can rename that as John simply inserting a line in the hash.   but, that interface has been tweaked so much that my script sometimes fails unexpectedly and i cannot constantly follow every small change.

so, is there (as i suspect) a way to parse these html files to detect a sender's email address in a mess of tags, to detect with no doubt a Date and merge them to serve the purpose?

I mean, is the module for parsing html code that powerful?   following, i post my ugly but handy program.
thanks!

#!/usr/bin/perl # # Mailman # $ARGV[0] || die "\n\tUsage: rp FILENAME\n\n"; open (INF, "< $ARGV[0]") || die "\n\tFile does not exist.\n\n"; %mesi=( 'Jan' => ' 01 ', 'Feb' => ' 02 ', 'Mar' => ' 03 ', 'Apr' => ' 04 ', 'May' => ' 05 ', 'Jun' => ' 06 ', 'Jul' => ' 07 ', 'Aug' => ' 08 ', 'Sep' => ' 09 ', 'Oct' => ' 10 ', 'Nov' => ' 11 ', 'Dec' => ' 12 ' ); %nomi=( 'elenaleonardi' => 'ele', 'lellobove' => 'cte', 'dark.prg' => 'mazzini', 'robertopar' => 'parolisi', 'meloinfo' => 'melo', 'massimogab' => 'max', 'simofiore' => 'simone', 'mpagnucci'=> 'pagnucci', 'zeromega'=> 'pagnucci', 'melojunior'=> 'melo', 'mmelillo'=> 'melo', 'marcomelillo'=> 'melo', 'matteobagn'=> 'matteo', 'uncoucou@fastmail.fm' => 'lorena', 'zappagalattica' => 'zappa', 'gavrilus' => 'zappa', 'carloalbertodue' => 'carlone', 'bugman996' => 'bug', 'michel' => 'ziobudda', 'Bagnoli' => 'matteo', 'salciaiola@infinito.it' => 'stegualerci', ); while (<INF>) { if (/From\&nbsp\;/ || /From\&\#160\;/ || /From /) { s/.*From\&nbsp;/From/; s/.*From\&\#160\;/From/; s/.*From /From/; s/\&nbsp;/ /g; s/PM.*/PM\n/g; s/AM.*/AM\n/g; s/<[^>]+>/ /g; s/\&quot;/\"/g; s/Date.+,//; s/&lt\;/\</; s/\@.*&gt\;/\>/; s/From\s+//; s/\s+/ /g; /, .*2003/; s/, //; s/200/0/; s/ (\d) /0$1 /; s/ ([A-Z][a-z][a-z]) /$mesi{$1}/; s/(\d\d) (\d\d) (\d\d)/$3$2$1/; s/\".+\"//; s/[\<\>]//g; s/(\d\d\d\d\d\d).+/$1\.html/; s/\s+//g; s/\"//g; print "Original: "; print; print "\n"; $neunam=$_; s/\d+.html//; if (exists $nomi{$_}) { $corr=$nomi{$_}; $old=$_; $_=$neunam; s/$old/$corr/; $neunam=$_; } print "Renamed as: "; print $neunam; print "\n"; last } } close (INF); rename $ARGV[0], $neunam if (! -e $neunam); if (-e $neunam) { $_=$neunam; s/.html/II.html/; $neunam=$_; rename $ARGV[0], $neunam if (! -e $neunam); } if (-e $neunam) { $_=$neunam; s/II.html/III.html/; $neunam=$_; rename $ARGV[0], $neunam if (! -e $neunam); } if (-e $neunam) { $_=$neunam; s/III.html/IV.html/; $neunam=$_; rename $ARGV[0], $neunam if (! -e $neunam); } if (-e $neunam) { $_=$neunam; s/IV.html/V.html/; $neunam=$_; rename $ARGV[0], $neunam if (! -e $neunam); }

janitored by ybiC: Balanced <readmore> tags around longish codeblock, also minor format tweaks for legibility

Replies are listed 'Best First'.
Re: Renaming html email dumps according to sender and date
by wfsp (Abbot) on Aug 08, 2004 at 17:49 UTC
    Hi Sigmund,

    Parsing html is always a pain and if it's a moving target, like you explain, it's even harder.

    I would recommend considering an html parser. I've been using HTML::Tokeparser::Simple lately. It is very easy to use and maintain. You could quickly adapt to any changes

    Your impressive list of regexes may also be vulnerable to changes and I would find that much more difficult to maintain.

    Also, some of your regexes are decoding html entities. I use the imaginatively named HTML::Entities to that.

    Could you post a snippet of the html?

      I could have easily posted a snippet, but it's a great mess and it's incredibly huge!
      btw, this is an example:

      Edit by tye, add READMORE

        Thanks. Sadly I'm at work at the moment and I have to clean a 4 unit web offset press : (

        I'll have a look later today.

Re: Renaming html email dumps according to sender and date
by wfsp (Abbot) on Aug 09, 2004 at 14:37 UTC
    Here's my first stab.
    #! perl use strict; use warnings; use HTML::TokeParser; my $p = HTML::TokeParser->new("sample.html") || die "Can't open: $!"; my %hash; while (my $t = $p->get_token){ next unless $t->[0] eq 'S' and $t->[1] eq 'th' and defined $t->[2]{'class'} and $t->[2]{'class'} eq 'DatTh'; my $key = $p->get_text('/th'); chop $key; $p->get_tag('td'); my $value = $p->get_text('/td'); $value =~ s/^\s*//; $value =~ s/\s*\[Add\]\s*$//; $hash{$key} = $value; } open OUT, '>', 'out.txt' or die; for my $key ( keys %hash ){ print OUT "$key => $hash{$key}\n" } close OUT;
    produces...
    Subject => PMR446 To => sigmund@fastmail.fm Date => Mon, 9 Aug 2004 10:28 AM  ( 59 mins 50 secs ago ) From => "Delboy" <delboyenterprises@hotmail.com>
    simplified extract from html...
    <table width="100%"> <tr align="left"> <th class="DatTh" width="0%">Date&nbsp;</th> <td class="DatTd" width="95%">Mon, 9 Aug 2004 10:28 AM &nbsp; +<small>( 59 mins 50 secs ago )</small></td> <td class="DatTd" rowspan="3" align="center" valign="top" widt +h="5%"><font size="-2"><a href="xxxxxxx">Text&nbsp;view</a><br><b> +HTML&nbsp;view</b><br><a href="xxxx">Print&nbsp;view</a></font></td> </tr> <tr align="left"> <th class="DatTh" width="0%">From&nbsp;</th> <td class="DatTd" width="95%">&quot;Delboy&quot; &lt;delboyent +erprises@hotmail.com&gt; <a title="Add addresses to your address book +" href="xxxx">[Add]</a></td> </tr> <tr align="left"> <th class="DatTh" width="0%">To&nbsp;</th> <td class="DatTd" width="95%">sigmund@fastmail.fm<a title=" +Add addresses to your address book" href="xxxxx">[Add]</a></td> </tr> <tr align="left"> <th class="DatTh" width="0%">Subject&nbsp;</th> <td class="DatTd" width="95%">PMR446</td> <td class="DatTd" align="center" width="5%"><font size="-2"><a + href="xxxx">Show&nbsp;full&nbsp;header</a></font></td> </tr> </table>
      Just to add to this excellent reply, that you can use the Date::Manip module to change the extracted date from a string to YYMMDD format in order to create a date-stamped output filename (after stripping the extra date info in brackets).

      Although this module could well be considered overkill for this simple task, it can deal with a huge variety of input formats, so is ideal for a situation (like this) where the expected data is subject to change outside your control.
      #! usr/bin/perl -w use strict; $\ = "\n"; use Date::Manip; &Date_Init("TZ=GMT","DateFormat=UK"); # or whatever while (<DATA>) { print UnixDate($_,"%y%m%d"); } __DATA__ Mon, 9 Aug 2004 10:28 AM 9th August 2004 10:28 10:28 9/8/2004 Aug 9 2004 10:28 10:28 09 August 2004
        Hello, brethren!
        This is what i did after your precious advices.
        Now my mailman program correctly parses that messy html and extracts the right fields to use 'em in renaming the file itself according to name and date.
        You should then check another post o'mine named "Epiphany" that I'll post in the "Meditations" section, as it's related to this script developing process...something i'm pretty silly!
        It also checks for already existing filenames and adds a roman number to distinguish between them.
        And, *WOW*, it's strict compliant! ;-)

        Thanks to everyone, ++ you all!

        Here's my code:
        #!/usr/bin/perl -w # # mailman # $\="\n"; use strict; use HTML::TokeParser; use Date::Manip; $ARGV[0] || die "\n\tUsage: rp FILENAME\n\n"; my $p = HTML::TokeParser->new("$ARGV[0]") || die "\nthe file $ARGV[0] +doesn't exist.\n\n";