Sigmund has asked for the wisdom of the Perl Monks concerning the following question:
their html has often changed to suit to their UI tweaks, and this wouldn't be a great issue if i hadn't coded a short Perl program to do, subsequently, these things:
- parse a dumped html file
- search for a line containing date and "From:" field
- extract the email of the sender or his/her name
- use these infos to rename the file itself as "sender040809.html"
it also checks for duplicates and adds a progressive I,II, III, IV, V, and checks for existing names in a hash and renames them accordingly. this way, if one has an email, say, of askdoedjpaauykja@mmmmmmmm.com i can rename that as John simply inserting a line in the hash. but, that interface has been tweaked so much that my script sometimes fails unexpectedly and i cannot constantly follow every small change.
so, is there (as i suspect) a way to parse these html files to detect a sender's email address in a mess of tags, to detect with no doubt a Date and merge them to serve the purpose?
I mean, is the module for parsing html code that powerful? following, i post my ugly but handy program.#!/usr/bin/perl # # Mailman # $ARGV[0] || die "\n\tUsage: rp FILENAME\n\n"; open (INF, "< $ARGV[0]") || die "\n\tFile does not exist.\n\n"; %mesi=( 'Jan' => ' 01 ', 'Feb' => ' 02 ', 'Mar' => ' 03 ', 'Apr' => ' 04 ', 'May' => ' 05 ', 'Jun' => ' 06 ', 'Jul' => ' 07 ', 'Aug' => ' 08 ', 'Sep' => ' 09 ', 'Oct' => ' 10 ', 'Nov' => ' 11 ', 'Dec' => ' 12 ' ); %nomi=( 'elenaleonardi' => 'ele', 'lellobove' => 'cte', 'dark.prg' => 'mazzini', 'robertopar' => 'parolisi', 'meloinfo' => 'melo', 'massimogab' => 'max', 'simofiore' => 'simone', 'mpagnucci'=> 'pagnucci', 'zeromega'=> 'pagnucci', 'melojunior'=> 'melo', 'mmelillo'=> 'melo', 'marcomelillo'=> 'melo', 'matteobagn'=> 'matteo', 'uncoucou@fastmail.fm' => 'lorena', 'zappagalattica' => 'zappa', 'gavrilus' => 'zappa', 'carloalbertodue' => 'carlone', 'bugman996' => 'bug', 'michel' => 'ziobudda', 'Bagnoli' => 'matteo', 'salciaiola@infinito.it' => 'stegualerci', ); while (<INF>) { if (/From\ \;/ || /From\&\#160\;/ || /From /) { s/.*From\ /From/; s/.*From\&\#160\;/From/; s/.*From /From/; s/\ / /g; s/PM.*/PM\n/g; s/AM.*/AM\n/g; s/<[^>]+>/ /g; s/\"/\"/g; s/Date.+,//; s/<\;/\</; s/\@.*>\;/\>/; s/From\s+//; s/\s+/ /g; /, .*2003/; s/, //; s/200/0/; s/ (\d) /0$1 /; s/ ([A-Z][a-z][a-z]) /$mesi{$1}/; s/(\d\d) (\d\d) (\d\d)/$3$2$1/; s/\".+\"//; s/[\<\>]//g; s/(\d\d\d\d\d\d).+/$1\.html/; s/\s+//g; s/\"//g; print "Original: "; print; print "\n"; $neunam=$_; s/\d+.html//; if (exists $nomi{$_}) { $corr=$nomi{$_}; $old=$_; $_=$neunam; s/$old/$corr/; $neunam=$_; } print "Renamed as: "; print $neunam; print "\n"; last } } close (INF); rename $ARGV[0], $neunam if (! -e $neunam); if (-e $neunam) { $_=$neunam; s/.html/II.html/; $neunam=$_; rename $ARGV[0], $neunam if (! -e $neunam); } if (-e $neunam) { $_=$neunam; s/II.html/III.html/; $neunam=$_; rename $ARGV[0], $neunam if (! -e $neunam); } if (-e $neunam) { $_=$neunam; s/III.html/IV.html/; $neunam=$_; rename $ARGV[0], $neunam if (! -e $neunam); } if (-e $neunam) { $_=$neunam; s/IV.html/V.html/; $neunam=$_; rename $ARGV[0], $neunam if (! -e $neunam); }
janitored by ybiC: Balanced <readmore> tags around longish codeblock, also minor format tweaks for legibility
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Renaming html email dumps according to sender and date
by wfsp (Abbot) on Aug 08, 2004 at 17:49 UTC | |
by Sigmund (Pilgrim) on Aug 09, 2004 at 09:30 UTC | |
by wfsp (Abbot) on Aug 09, 2004 at 10:06 UTC | |
|
Re: Renaming html email dumps according to sender and date
by wfsp (Abbot) on Aug 09, 2004 at 14:37 UTC | |
by Mr_Jon (Monk) on Aug 09, 2004 at 18:01 UTC | |
by Sigmund (Pilgrim) on Aug 21, 2004 at 16:08 UTC | |
by Mr_Jon (Monk) on Aug 23, 2004 at 17:22 UTC |