CalebH has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I am having a slight problem with a script I have been working on to parse my emails on a Windows 7 machine.

The script is composed mostly of bits and pieces of other scripts I have found over the last few months. Originally, I was trying to have it accomplish two tasks - 1) Extract attachments to a specific directory as long as they are from a certain address, and 2) Extract all plaintext messages if they are also from specific addresses.

#!/usr/local/bin/perl -w use MIME::Parser; use IO::File; use Email::MIME::Attachment::Stripper; use File::Slurp qw(write_file); my $output_path = 'C:\\phone\\'; my $word_to_delete = "deleteme"; my $movedir = 'completed'; use Cwd qw(); @files = <*.eml>; foreach $file (@files) { my $parser = MIME::Parser->new(); $parser->output_dir($output_path); $parser->output_prefix('deleteme'); $parser->output_to_core(0); $parser->extract_nested_messages(1); $parser->parse_nested_messages(1); open(INPUT, "$file") or die("Input error: $!"); local $/ = undef; my $textstuff = <INPUT>; my $entity = $parser->parse_open($file) or die ("Input error: $!"); close (INPUT); my $pathx = Cwd::cwd(); my $header = $entity->head; $subject = $header->get('Subject'); $to = $header->get('To'); $from = $header->get('From'); $date = $header->get('Date'); $mime = $entity->mime_type; $effective = $entity->effective_type; print "[File]: $file...\n"; if ($from =~ "address" || $from =~ "address2" || $from =~ "address3" | +| $from =~ "address4" || $from =~ "address5" || $from =~ "address6" | +| $from =~ "address7") { if ($to =~ "address" || $to =~ "address2" || $to =~ "address3" || $to +=~ "address4" || $to =~ "address5" || $to =~ "address6" || $to =~ "ad +dress7") { if ($textstuff =~ /sent from my ipod/ig || $textstuff =~ /sent from my + iphone/ig) { print "Subject: $subject\nFrom: $from\nTo: $to\nDate: $date\nContent T +ype: content\nMime Type: $mime\nEffective Type: $effective\n"; open PART2, $file; while (<PART2>) { open (LOG, ">>C:\\phone\\MyStuff.txt"); $_ =~ s/Sent from my iPod//ig; $_ =~ s/Sent from my iPhone//ig; if ($entity->parts > 0) { print ">> Multi Part Message\n"; for (my $i=0;$i<$entity->parts;$i++) { my $subEntity = $entity->parts($i); print ">> MIME: ", $subEntity->mime_type," \n"; if ($subEntity->mime_type eq 'text/plain') { if (my $io = $subEntity->open("r")) { while (defined($_=$io->getline)) { $body2 = $_; $body2 =~ s/Sent from my iPod//ig; $body2 =~ s/Sent from my iPhone//ig; $body2 =~ s/\n//ig; $subject =~ s/\n//ig; print LOG "<a href=\"$body2\">$subject</a><p>\n\n"; } } } } } else { my $body = join "", @{$entity->body}; print ">> Not a multipart\n"; print $body; $body =~ s/Sent from my iPod//ig; $body =~ s/Sent from my iPhone//ig; $body =~ s/\n//ig; $subject =~ s/\n//ig; print LOG "<a href=\"$body\">$subject</a><p>\n\n"; close (LOG); }} close PART2; }} my $m = Email::MIME::Attachment::Stripper->new($textstuff, \'force_fil +ename' => 1); my $msg = $m->message; my @attachments = $m->attachments; foreach my $a(@attachments) { if ($a->{filename} ne "") { my $img = new IO::File "C:/phone/" . $a->{filename}, "w" or die "Can not create file! $!"; binmode $img; print $img $a->{payload}; print " -- Extracting -- : " . $a->{filename} . "..\n"; $entity->dump_skeleton; } } }if (!-d $movedir) { mkdir $movedir or die "Couldn't create $movedir - $!\n"; } if (-e $movedir) { print "Moving $pathx/$file to $pathx/$movedir/$file\n\n"; rename $pathx . "/" . $file, $pathx . "/" . $movedir . "/" . $file or +die "Couldn't move $file to $movedir - $!\n"; } unlink (glob("$output_path/*$word_to_delete*")) or warn "can't delete +files: $!"; $entity->dump_skeleton; }

Extracting attachments works in most cases (I still end up with files named like '1ddfdfdf' with no file type at times), and usually the text extraction will work as long as it has the ipod text above. The script parses to see if it was sent by my ipod touch or iphone, and in turn grabs links that I have sent myself to view later.

The problem comes when the file is a link from, as an example, address3 above. Address3 would be my old cell phone (not an iphone or my old ipod touch), and the email message will contain one part, which is text/plain. Since the 'sent from ipod' never matches, it never logs it. I have tried to edit the pattern matching part to start logging by making it into this:

if ($textstuff =~ /sent from my ipod/ig || $textstuff =~ /sent from my + iphone/ig || $from =~ /address3/ig) {

As well as

if ($textstuff =~ /sent from my ipod/ig || $textstuff =~ /sent from my + iphone/ig || $_ =~ /address3/ig) {

Unfortunately, it never seems to trigger the log opening up and adding the links it finds to the log. Any ideas on what would make this work better, and actually log the urls? Also, does anyone have any pointers or code changing recommendations on how to make it log any messages that match any of the from addresses, IF they don't have an attachment? The way I had it written previously, it would log all files, including the binary code, to a text file. It started to add up when I had over 300 attachments.

For anyone wondering, I am basically trying to build a bot of sorts to go through and extract old links, and I plan to use lwp simple or the Firefox mechanize module to crawl those websites and save a local copy for me.

Thanks all for your help!

Replies are listed 'Best First'.
Re: Perl MIME parser partially works with my code I wrote (code does not exist , there is no code)
by Anonymous Monk on Oct 03, 2013 at 07:48 UTC

    When you get stuck, and what you think should be matching is not matching, its time to ddumperBasic debugging checklist the data you're matching against ( $textstuff, $from ), so you can figure out if the problem is with the data (missing, not what you expected), or with your regex pattern ( m//matching or s///ubstitution )

    The script is composed mostly of bits and pieces of other scripts I have found over the last few months .... ny ideas on what would make this work better, and actually log the urls? make it log any messages that match any of the from addresses, IF they don't have an attachment? Thanks all for your help!

    Hello and congratulations, you got pretty far, you've got a prototype, but now you're a little bit stuck, its time to rethink your approach :) its time to start over :)

    Why? :) Because you have lines eleven indentation levels deep, there are too many variables around to keep track of ; you need more subroutines

    The way I would approach this problem/exercise/task, is to pretend the code you have written doesn't exist, grab a pencil and paper, and draw a few boxes , putting a goal into each box :) say

    Now that you have goals, start turning them into subroutines, so when you get stuck you can copy/paste only the subroutine which isn't working and concentrate on only fixing it -- easier than fixing entire program :)

    So you then write something like

    Or like

    or something like this, all depending on how complicated the matching/extracting is and how it needs to be grouped , which parts are common/similar/alike/reusable

    #!/usr/bin/perl -- use strict; use warnings; use MIME::Parser; ... Main( @ARGV ); exit( 0 ); sub Maincakes { ... my @files = get_files( $fromdir ); for my $file ( @files ){ iwant_iphone_links( $file ) or iwant_other_links( $file ) or iwant_pancake_links( $file ); } } sub iwant_iphone_links { ...; return $stop_or_keep_going } sub iwant_other_links { ...; return 1 } sub iwant_pancake_links { ...; return 0 } __END__

    More of this type of idea of rewriting your code in Re: RFC: beginner level script improvement (version control), skimmable code is the idea, more subs, more subs, more subs, more subs, more subs,

    More generic advice :) On debugging, verify everything, talk to teddybear ... checklists and more talking to yourself out loud is a pretty good debugging technique :) 1 / 2/3

      Great advice, and I've printed the script out to look it over and pick out the code I have so far to place in subroutines.

      As of now, I have changed some of the code to execute almost the entire code in a subroutine &ipod;, which I have set to match no matter what in a if/else statement (It runs the logger in both cases).

      if ($from =~ /address|address2/ig) { if ($to =~ /address|address2/ig) { if ($textstuff =~ /sent from my ipod/ig || $textstuff =~ /sent from my + iphone/ig) { print "From ME----\nSubject: $subject\nFrom: $from\nTo: $to\nDate: $da +te\nContent Type: content\nMime Type: $mime\nEffective Type: $effecti +ve\n"; &ipod; } else { if ($textstuff !~ /sent from my ipod/ig || $textstuff !~ /sent from my + iphone/ig) { print "----NOT ipod ----\nSubject: $subject\nFrom: $from\nTo: $to\nDat +e: $date\nContent Type: content\nMime Type: $mime\nEffective Type: $e +ffective\n"; &ipod; }

      It's ugly, but it gets the job done mostly. Unfortunately, it still suffers from only running through and prints only one instance of 'from ME' or '----NOT ipod' and then no longer prints any messages, other than the statement that it's moving files. Also, it never reaches the bottom statement to extract files using attachment:stripper, which is somewhat confusing.

      I would imagine that it's a problem with the foreach statement only executing once due to the fact that it stops printing, but since it executes all the other code (opens the log file, prints to it, executes the moving of files), I just have to try to track down where the problem is happening as well as why it's not executing the attachment stripper section of code.

      Thanks again for the reply, and once I can track down why the above is happening, I plan to redo it into subroutines (as well as all future scripts). :-)