in reply to Renaming html email dumps according to sender and date

Here's my first stab.
#! perl use strict; use warnings; use HTML::TokeParser; my $p = HTML::TokeParser->new("sample.html") || die "Can't open: $!"; my %hash; while (my $t = $p->get_token){ next unless $t->[0] eq 'S' and $t->[1] eq 'th' and defined $t->[2]{'class'} and $t->[2]{'class'} eq 'DatTh'; my $key = $p->get_text('/th'); chop $key; $p->get_tag('td'); my $value = $p->get_text('/td'); $value =~ s/^\s*//; $value =~ s/\s*\[Add\]\s*$//; $hash{$key} = $value; } open OUT, '>', 'out.txt' or die; for my $key ( keys %hash ){ print OUT "$key => $hash{$key}\n" } close OUT;
produces...
Subject => PMR446 To => sigmund@fastmail.fm Date => Mon, 9 Aug 2004 10:28 AM  ( 59 mins 50 secs ago ) From => "Delboy" <delboyenterprises@hotmail.com>
simplified extract from html...
<table width="100%"> <tr align="left"> <th class="DatTh" width="0%">Date&nbsp;</th> <td class="DatTd" width="95%">Mon, 9 Aug 2004 10:28 AM &nbsp; +<small>( 59 mins 50 secs ago )</small></td> <td class="DatTd" rowspan="3" align="center" valign="top" widt +h="5%"><font size="-2"><a href="xxxxxxx">Text&nbsp;view</a><br><b> +HTML&nbsp;view</b><br><a href="xxxx">Print&nbsp;view</a></font></td> </tr> <tr align="left"> <th class="DatTh" width="0%">From&nbsp;</th> <td class="DatTd" width="95%">&quot;Delboy&quot; &lt;delboyent +erprises@hotmail.com&gt; <a title="Add addresses to your address book +" href="xxxx">[Add]</a></td> </tr> <tr align="left"> <th class="DatTh" width="0%">To&nbsp;</th> <td class="DatTd" width="95%">sigmund@fastmail.fm<a title=" +Add addresses to your address book" href="xxxxx">[Add]</a></td> </tr> <tr align="left"> <th class="DatTh" width="0%">Subject&nbsp;</th> <td class="DatTd" width="95%">PMR446</td> <td class="DatTd" align="center" width="5%"><font size="-2"><a + href="xxxx">Show&nbsp;full&nbsp;header</a></font></td> </tr> </table>

Replies are listed 'Best First'.
Re^2: Renaming html email dumps according to sender and date
by Mr_Jon (Monk) on Aug 09, 2004 at 18:01 UTC
    Just to add to this excellent reply, that you can use the Date::Manip module to change the extracted date from a string to YYMMDD format in order to create a date-stamped output filename (after stripping the extra date info in brackets).

    Although this module could well be considered overkill for this simple task, it can deal with a huge variety of input formats, so is ideal for a situation (like this) where the expected data is subject to change outside your control.
    #! usr/bin/perl -w use strict; $\ = "\n"; use Date::Manip; &Date_Init("TZ=GMT","DateFormat=UK"); # or whatever while (<DATA>) { print UnixDate($_,"%y%m%d"); } __DATA__ Mon, 9 Aug 2004 10:28 AM 9th August 2004 10:28 10:28 9/8/2004 Aug 9 2004 10:28 10:28 09 August 2004
      Hello, brethren!
      This is what i did after your precious advices.
      Now my mailman program correctly parses that messy html and extracts the right fields to use 'em in renaming the file itself according to name and date.
      You should then check another post o'mine named "Epiphany" that I'll post in the "Meditations" section, as it's related to this script developing process...something i'm pretty silly!
      It also checks for already existing filenames and adds a roman number to distinguish between them.
      And, *WOW*, it's strict compliant! ;-)

      Thanks to everyone, ++ you all!

      Here's my code:
      #!/usr/bin/perl -w # # mailman # $\="\n"; use strict; use HTML::TokeParser; use Date::Manip; $ARGV[0] || die "\n\tUsage: rp FILENAME\n\n"; my $p = HTML::TokeParser->new("$ARGV[0]") || die "\nthe file $ARGV[0] +doesn't exist.\n\n";
        A nice way of adding incremental roman numerals, as you do at the end of your script, would be to use the Math::Roman module (handily mentioned in the Perl Cookbook). This would impose no arbitrary limit on the messages you receive from the same person - it even goes above the 'highest' Roman numeral of 5000.
        #!/usr/bin/perl -w use strict; use Math::Roman qw(roman); my $roman = new Math::Roman; $\ = "\n"; while (<DATA>) { chomp; my $old_name = $_; if ($old_name =~ /^(.+\d{6})(\w+)?(\.htm)$/) { my $old_roman = $2 || 1; my $new_roman = roman("$old_roman") + 1; my $new_name = $1 . $new_roman . $3; print "$old_name => $new_name"; } } __DATA__ filename230804.htm filename230804IV.htm filename230804V.htm filename230804II.htm filename230804X.htm
        Output:
        filename230804.htm => filename230804II.htm filename230804IV.htm => filename230804V.htm filename230804V.htm => filename230804VI.htm filename230804II.htm => filename230804III.htm filename230804X.htm => filename230804XI.htm
        Of course, you could simply append 'normal' numbers to achieve the same result, but that wouldn't be half as much fun...