hi brethren,
it'been a long time since i last posted on perlmonks, but there's a question that i want to solve.   i have an excellent webmail account on fastmail.fm, and i use to dump my checked messages as simple html files to store for backup purposes.

their html has often changed to suit to their UI tweaks, and this wouldn't be a great issue if i hadn't coded a short Perl program to do, subsequently, these things:

- parse a dumped html file

- search for a line containing date and "From:" field

- extract the email of the sender or his/her name

- use these infos to rename the file itself as "sender040809.html"

it also checks for duplicates and adds a progressive I,II, III, IV, V, and checks for existing names in a hash and renames them accordingly.   this way, if one has an email, say, of askdoedjpaauykja@mmmmmmmm.com i can rename that as John simply inserting a line in the hash.   but, that interface has been tweaked so much that my script sometimes fails unexpectedly and i cannot constantly follow every small change.

so, is there (as i suspect) a way to parse these html files to detect a sender's email address in a mess of tags, to detect with no doubt a Date and merge them to serve the purpose?

I mean, is the module for parsing html code that powerful?   following, i post my ugly but handy program.
thanks!

#!/usr/bin/perl # # Mailman # $ARGV[0] || die "\n\tUsage: rp FILENAME\n\n"; open (INF, "< $ARGV[0]") || die "\n\tFile does not exist.\n\n"; %mesi=( 'Jan' => ' 01 ', 'Feb' => ' 02 ', 'Mar' => ' 03 ', 'Apr' => ' 04 ', 'May' => ' 05 ', 'Jun' => ' 06 ', 'Jul' => ' 07 ', 'Aug' => ' 08 ', 'Sep' => ' 09 ', 'Oct' => ' 10 ', 'Nov' => ' 11 ', 'Dec' => ' 12 ' ); %nomi=( 'elenaleonardi' => 'ele', 'lellobove' => 'cte', 'dark.prg' => 'mazzini', 'robertopar' => 'parolisi', 'meloinfo' => 'melo', 'massimogab' => 'max', 'simofiore' => 'simone', 'mpagnucci'=> 'pagnucci', 'zeromega'=> 'pagnucci', 'melojunior'=> 'melo', 'mmelillo'=> 'melo', 'marcomelillo'=> 'melo', 'matteobagn'=> 'matteo', 'uncoucou@fastmail.fm' => 'lorena', 'zappagalattica' => 'zappa', 'gavrilus' => 'zappa', 'carloalbertodue' => 'carlone', 'bugman996' => 'bug', 'michel' => 'ziobudda', 'Bagnoli' => 'matteo', 'salciaiola@infinito.it' => 'stegualerci', ); while (<INF>) { if (/From\&nbsp\;/ || /From\&\#160\;/ || /From /) { s/.*From\&nbsp;/From/; s/.*From\&\#160\;/From/; s/.*From /From/; s/\&nbsp;/ /g; s/PM.*/PM\n/g; s/AM.*/AM\n/g; s/<[^>]+>/ /g; s/\&quot;/\"/g; s/Date.+,//; s/&lt\;/\</; s/\@.*&gt\;/\>/; s/From\s+//; s/\s+/ /g; /, .*2003/; s/, //; s/200/0/; s/ (\d) /0$1 /; s/ ([A-Z][a-z][a-z]) /$mesi{$1}/; s/(\d\d) (\d\d) (\d\d)/$3$2$1/; s/\".+\"//; s/[\<\>]//g; s/(\d\d\d\d\d\d).+/$1\.html/; s/\s+//g; s/\"//g; print "Original: "; print; print "\n"; $neunam=$_; s/\d+.html//; if (exists $nomi{$_}) { $corr=$nomi{$_}; $old=$_; $_=$neunam; s/$old/$corr/; $neunam=$_; } print "Renamed as: "; print $neunam; print "\n"; last } } close (INF); rename $ARGV[0], $neunam if (! -e $neunam); if (-e $neunam) { $_=$neunam; s/.html/II.html/; $neunam=$_; rename $ARGV[0], $neunam if (! -e $neunam); } if (-e $neunam) { $_=$neunam; s/II.html/III.html/; $neunam=$_; rename $ARGV[0], $neunam if (! -e $neunam); } if (-e $neunam) { $_=$neunam; s/III.html/IV.html/; $neunam=$_; rename $ARGV[0], $neunam if (! -e $neunam); } if (-e $neunam) { $_=$neunam; s/IV.html/V.html/; $neunam=$_; rename $ARGV[0], $neunam if (! -e $neunam); }

janitored by ybiC: Balanced <readmore> tags around longish codeblock, also minor format tweaks for legibility


In reply to Renaming html email dumps according to sender and date by Sigmund

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.