Get CID inline attachments with MIME::Parser

sebastiannielsen has asked for the wisdom of the Perl Monks concerning the following question:

I have a postfix server which mailbox_command's into this perl script:

#!/usr/bin/perl

use MIME::Parser;
srand;

$parser = new MIME::Parser;
$foldername = time().int(rand(10)).int(rand(10)).int(rand(10)).int(ran
+d(10)).int(rand(10));
mkdir("/var/smtp/my/recv/".$foldername);
$parser->output_dir("/var/smtp/my/recv/".$foldername);
$parser->tmp_dir("/var/smtp/tmp");
eval{$parser->parse(\*STDIN);};
[download]

It works very well, EXCEPT for mails which contain inline attachments. I need to get the CID of these attachments. It should work "unlimited" deep into the hiearchy, so even if the mail contains a another mail which contains a third mail which contains... and so on and so on it should work.

The attachments are printed with their original filenames. Is there any way to:

-either get the MIME::Parser's filename of a inline attachment with the CID (like "1290910643.25733.0.camel@sebastian-desktop", stored in a variable) converted to the filename (like "image.png").

-or get a list of all CIDs found in a mail, and with their corresponding filenames "invented" by MIME::Parser.

-or be able to input a filename created by MIME::Parser, and get it corresponding CID.

The best would be the second option, if I could get a list of all CIDs found in a MIME mail, with their corresponding filenames "created" by MIME::Parser. Note that MIME::Parser creates a file with the name + a -1 like image-1.png, if theres multiple files with the same name. This needs also to be taken in consideration.

I can then write this list into a text file placed in the mail folder for the webmail system to pick up, for it to replace the CIDs in the IMG tags with their corresponding correct paths to the images.

I also would want to get a list of the MIME content types for all the attachments which MIME::Parser writes along with their MIME::Parser filenames.

If you wonder, im building a webmail system in perl which does NOT go through the POP3, instead it goes straight from SMTP server into the webmail folder, and it is parsed and complete so the only thing the webmail system has to do is to display the mails and eventual attachments.

******************************************************************

Solved

******************************************************************

Solved it now thanks to Anonymous Monk. The script now looks like this, and it works very well:

#!/usr/bin/perl

use MIME::Parser;
use MIME::Base64;
srand;
$parser = new MIME::Parser;
$foldername = time().int(rand(10)).int(rand(10)).int(rand(10)).int(ran
+d(10)).int(rand(10));
mkdir("/var/smtp/my/recv/".$foldername);
$parser->output_dir("/var/smtp/my/recv/".$foldername);
$parser->tmp_dir("/var/smtp/tmp");
$returnentity = eval{$parser->parse(\*STDIN);};
@writetofile = dump_entity($returnentity);

open(INDEXFILE, ">/var/smtp/my/recv/".$foldername."/mailindex.txt");
flock(INDEXFILE,2);
print INDEXFILE @writetofile;
close(INDEXFILE);

sub dump_entity {
    my ($entity) = @_;
    my @filedata = ();
    my @parts = $entity->parts;
    if (@parts) {
    my $i;
    foreach $i (0 .. $#parts) {
        push(@filedata, dump_entity($parts[$i]));
    }
    }
    else {
    my $filepath = $entity->bodyhandle->path;
    $filepath =~ s/^(.*)\/([^\/]*)$/$2/si;
    $filepath = encode_base64($filepath);
    $filepath =~ s/\n//sgi;
    $filepath =~ s/\r//sgi;
    $filepath =~ s/\t//sgi;

    my $fileid = encode_base64($entity->head->get('content-id'));
    $fileid =~ s/\n//sgi;
    $fileid =~ s/\r//sgi;
    $fileid =~ s/\t//sgi;
    my $filetype = encode_base64($entity->head->mime_type);
    $filetype =~ s/\n//sgi;
    $filetype =~ s/\r//sgi;
    $filetype =~ s/\t//sgi;
    push(@filedata, $filepath."::".$fileid."::".$filetype."\n");
    }
 return @filedata;
}
[download]

I get the attachments as binary files in $foldername, a convient file called mailindex.txt in the $foldername with the following content:

bXNnLTMyNTUwLTEudHh0::::dGV4dC9wbGFpbg==
bXNnLTMyNTUwLTIuaHRtbA==::::dGV4dC9odG1s
Ym90LnBuZw==::PDEyOTA5ODUxNTMuMzIxNDMuMC5jYW1lbEBzZWJhc3RpYW4tZGVza3Rv
+cD4K::aW1hZ2UvcG5n
[download]

Decoded:

msg-32550-1.txt :: :: text/plain
msg-32550-2.html :: :: text/html
bot.png :: <1290985153.32143.0.camel@sebastian-desktop> :: image/png
[download]

msg-32550-2.html looks like this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN">
<HTML>
<HEAD>
  <META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=UTF-8">
  <META NAME="GENERATOR" CONTENT="GtkHTML/3.28.1">
</HEAD>
<BODY>
s&#246;n 2010-11-28 klockan 23:59 +0100 skrev sebastian:<BR>
<BLOCKQUOTE TYPE=CITE>
    <IMG SRC="cid:1290985153.32143.0.camel@sebastian-desktop" ALIGN="b
+ottom" BORDER="0"><BR>
</BLOCKQUOTE>
test
</BODY>
</HTML>
[download]

As you see, it will be very easy to replace the "cid:1290985153.32143.0.camel@sebastian-desktop" src with something like http://www.mydomain.com/webmail/getattachment.cgi?attachment=bot.png

Everything solved.

Comment on Get CID inline attachments with MIME::Parser Select or Download Code

Replies are listed 'Best First'.
Re: Get CID inline attachments with MIME::Parser by afoken (Chancellor) on Nov 28, 2010 at 08:15 UTC
Scary. You don't need to call srand. It even can make things worse. Repeated calls to rand won't improve the quality of the random number, because `rand` is implemented as a pseudo-random number generator. Implicit rounding due to int makes the result even less random. You don't test that the folder $foldername exists. This is a good thing, because you would otherwise create a race condition. But you also don't test if mkdir failed because the folder already exists (EEXIST). So you may end writing several different mails into the same folder. The whole idea of bypassing conventional e-mail handling looks just wrong. No one can access his/her e-mails except by your still-to-be-written web mailer. No IMAP, no POP3. (I prefer having my e-mails in my mail client, because every single web mailer I've ever seen just sucks in one way or the other.) You will have to spend a considerable amount of time in the webmail code to reconstruct what MIME::Parser did. Probably you will have to call MIME::Parser again. Double work, nothing won. Are you sure that you get the file permissions right? How do you prevent malicious users from reading other users' e-mails? Premature optimization, probably because you wrongly assume that you will need to parse each and every mail in your webserver. Often, you only need to parse the mail headers, because the user will never read the e-mail or its attachments, e.g. for spam or stalking mails. You need to have your SMTP server and your webmailer on the same machine, or at least both have to use a shared storage (NFS). At that point, things become really complicated, and you will end doing something that maildir (see below) already does. Probably, you will do it wrong (because it is hard to do it right), and lose some e-mails. You will have a hard time switching to a different mail server because you have to re-integrate your hack into the new mail server. I think you should have a look at IMAP::Client. Did you know that you can very efficently fetch just the headers of an e-mail via IMAP? This is optimal for a webmailer when you want to show the contents of a mailbox. Simply because you do less work for the same result. So, you should really use IMAP to access the e-mails. Access via IMAP also allows you to cleanly separate mail server and webmail. It allows you to switch mail servers without having to touch a single bit of your webmailer code. You can even use your webmailer to access several different IMAP servers in parallel. I think that you should run MIME::Parser only on demand, i.e. when displaying a mail in your webmailer. You could add a caching layer (i.e. inherit from MIME::Parser and add caching logic) that avoids calling MIME::Parser more than once for any given e-mail, storing not only the attachments, but the entire state of MIME::Parser into a (set of) files in a directory accessible only to the owner of the mail. Find out if Storable, Data::Dumper, or some other class can help you serialising and unserialising MIME::Parser objects. Benchmark that! The native e-mail format serialises the same data, and perhaps it is faster to have `MIME::Parser` parse that format again than to reconstruct a `MIME::Parser` object from `Storable` or `Data::Dumper`. Sure, running MIME::Parser inside the webmailer slows down the webmailer. Benchmark how much it will slow down. With typical small mails (less than 1 MByte), I would bet that MIME::Parser is so fast that you won't notice any delay at all. For larger mails, do what others to when the user has to wait for a long-running action: Show an animation. In a web mailer, you would typically use an animated GIF, using a little bit of JS and CSS to show it overlaying the current page immediately before the mail view page is loaded. It doesn't speed up things, but makes your users more patient. If you really, really want to avoid IMAP at all costs, have a look at maildir, maildir++, and IMAPdir. They give you an instant "one e-mail == one file" solution, with no need for locking, and with unique file names. You can use the file name, minus the info (flags) part, as a key for caching already-processed e-mails. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re^2: Get CID inline attachments with MIME::Parser by sebastiannielsen (Initiate) on Nov 28, 2010 at 20:58 UTC
About srand: I dont know if this are updated now, but when I first begun learning perl, I noticed that if I would run 2 instances of a script printing a sequence of random digits, both scripts would show the same sequence if not srand; was runned in the beginning. Therefore, I have get used to srand; in the beginning when Im gonna use rand(); About repeated calls to rand: The repeated calls to rand is to force leading zeroes in case I get a number like 000001. Perl would normally strip off all leading zeroes leading to strange filenames. By calling int(rand(10) repeated times, I guarantee that the resulting number will have this number of digits. So if I would want to generate a 10 digit number, thats ALWAYS 10 digits, even if the number coming out is 1, I would run: `$number = int(rand(10)).int(rand(10)).int(rand(10)).int(rand(10)).int(rand(10)).int(rand(10)).int(rand(10)).int(rand(10)).int(rand(10)).int(rand(10));` That would force the number to the string "0000000001" if it gets that number, and not "1". About testing if $foldername exist Since the filename is consisted of time() and 5 random digits, for the chance to happen that it would write the mail in the same folder, is in the following prerequistes: -2 or more mails must be finished by postfix in the exact same second. -Both 5 digit random numbers must be exactly same. The chance for this happening is: 0,001 %. you would have a higher chance of winning lucky numbers on TV, than both of these prerequistes happening at the same time. And IF that would happen, no ill would be happen except for 2 mails getting merged into one. The mail SMTP server IS located in the Web server, they are running on the same machine! Thats why I want to skip all overhead with going internally through IMAP and POP3. I also prefer to block incoming IMAP and POP3 in fw for security reasons and only have port 80, 25 and 53 open in fw. The problem with parsing the mail as-its opened by the receiver, is if someone would send you a lets say a 50 MB mail with attachments of 49MB. You might not want to have to download that attachment, but you want to still read the body of the mail. You would still have to wait until the attachments is parsed before body can be opened. Theres NO MIME parser in the webmail system. All the webmail system does is to scan a folder of files and generating a output based on that. MIME::Parser will have parsed everything when a email has been received. About permissions: I prefer to code the permission system itself. As you might see, the mail is placed in the /my/ folder. Thats a user of the webmail system. When I have get all running, I will implement so the system will place the mail in the /$user/ folder where $user is the part before @. No malicious user can access other user's mail since their login will make the webmail system read from "their" folder. Theres no need to config unix permissions since no unauthorized has admin/physical access to the server machine. About maildir: Maildir are writing the mail to the disk before its parsed. Thats means parsing has to wait until mail is fully delivered. By letting postfix stream in the mail into the parser while the remote MTA is still writing to my postfix server, I can launch parsing at the same time as the remote MTA sends "DATA" to me. This speeds up things. The mail is written to disk completely parsed and ready for the webmail system to pick up. About switching mailservers: I selected postfix because its efficient and it can stream the mail to a mailbox command's STDIN. If I would switch mailserver, I would require that the mailserver can do that. If theres a MUST to switch a mailserver to a noncompatible type, it would be as easy to replace parse(\*STDIN) in my script with parse_open($path_to_mailfile_in_mailserver) since most, if not all SMTP servers, would write a MIME file somewhere.	[reply] [d/l]
Re^3: Get CID inline attachments with MIME::Parser by roboticus (Chancellor) on Nov 28, 2010 at 21:17 UTC
sebastiannielsen: About srand: I dont know if this are updated now, but when I first begun learning perl, I noticed that if I would run 2 instances of a script printing a sequence of random digits, both scripts would show the same sequence if not srand; was runned in the beginning. Therefore, I have get used to srand; in the beginning when Im gonna use rand(); A typical random number generator will generate the same sequence of numbers with the same initial state. It can be a blessing or a curse. Be sure to seed your random number generator when you want individual runs to be different. I typically use `time` for that. About repeated calls to rand: The repeated calls to rand is to force leading zeroes in case I get a number like 000001. Perl would normally strip off all leading zeroes leading to strange filenames. By calling int(rand(10) repeated times, I guarantee that the resulting number will have this number of digits. So if I would want to generate a 10 digit number, thats ALWAYS 10 digits, even if the number coming out is 1, I would run: `$number = int(rand(10)).int(rand(10)).int(rand(10)).int(rand(10)).int( +rand(10)).int(rand(10)).int(rand(10)).int(rand(10)).int(rand(10)).int +(rand(10));` [download] That would force the number to the string "0000000001" if it gets that number, and not "1". A simpler method would be: `$number = sprintf "%010u", int(rand(10000000000));` ...roboticus	[reply] [d/l] [select]
Re^4: Get CID inline attachments with MIME::Parser by Marshall (Canon) on Nov 28, 2010 at 23:24 UTC
Re^3: Get CID inline attachments with MIME::Parser by afoken (Chancellor) on Nov 29, 2010 at 13:35 UTC
repeated calls to rand sprintf And by the way: why would it hurt to have directory names without leading zeros in front of the random number part? Separate timestamp and random number by some non-digit character and both parts can no longer collide. Given a Unix-based system, the random number doesn't even have to be an integer to be part of the filename. You could use one call to `rand()` to get a number between 0 and 1 with a lot of digits, and those use the full potential of the random number generator. About testing if $foldername exist Since the filename is consisted of time() and 5 random digits, for the chance to happen that it would write the mail in the same folder, is in the following prerequistes: -2 or more mails must be finished by postfix in the exact same second. -Both 5 digit random numbers must be exactly same. The chance for this happening is: 0,001 %. Testing that `mkdir` did not fail with EEXIST gives you a collision chance of exactly zero. Whenever you see EEXIST, generate a new random filename and try again. Not testing at all that `mkdir` succeeded can cause subsequent errors. If you omit further error checks, this may end in data loss. `open()` in your updated posting has no traces of error checks, neither `or die` nor `use autodie`. `$parser->parse()` is even wrapped in an `eval {}`, but no code checks `$@` or the state of `$parser` after that. On the servers I use, it is quite possible that two instances of the mail server run in parallel and each deliver one e-mail in exactly the same second. Are you sure that the probability of generating two identical strings from two runs of `int(rand(10)).int(rand(10)).int(rand(10)).int(rand(10)).int(rand(10))` is just 0.001 %? If `rand()` was be completely fair, random, and independant from past results, the probability for each of the possiible combinations from 00000 to 99999 would be equal. So you had a chance of 1 in 100000 to for each combination. But `rand()` is a pseudo random number generator, where each result depends on the internal state of the PRNG. Combined with the massive rounding due to `int`, I guess that the combinations are not equally distributed, and so the collision probability is higher. Did you know that the PRNG on some perl interpreters has just 15 bits, i.e. 32768 different "random" numbers? The mail SMTP server IS located in the Web server, they are running on the same machine! Thats why I want to skip all overhead with going internally through IMAP and POP3. Sure, it is now. But that approach won't scale when you need to support more users than the machine can handle. Being able to separate mail and web services to two or more different machines would help you. For that, you would need a clear distinction between both. IMAP could clearly separate both services. I also prefer to block incoming IMAP and POP3 in fw for security reasons and only have port 80, 25 and 53 open in fw. And this is relevant because ...? Given your code that lacks error checks and taint mode while processing data from untrustworthy sources, I guess that attacks via HTTP or SMTP are quite possible. Port filtering won't help at all. And if you run an ancient version of BIND on port 53, your server is very likely already rooted. Disabling all unused services is a good idea, because it reduces the risk of being attacked. But still, you could use IMAP here, simply by configuring the imapd to listen only to connections from localhost. Should your needs grow, you could connect mail server and web server by a cable between two dedicated network cards, and make imapd listen only on the address assigned to that card. The problem with parsing the mail as-its opened by the receiver, is if someone would send you a lets say a 50 MB mail with attachments of 49MB. You might not want to have to download that attachment, but you want to still read the body of the mail. You would still have to wait until the attachments is parsed before body can be opened. This is probably a limitation of MIME::Parser. But it is not a generic limitation of the e-mail system as we currently use it. You can stop parsing the e-mail at any arbitary point and use what you got so far. You don't have to process attachments to see the mail body. You may need to decode some or all attachments if the mail body is HTML and refers some or all attachments. About permissions: I prefer to code the permission system itself. As you might see, the mail is placed in the /my/ folder. Thats a user of the webmail system. When I have get all running, I will implement so the system will place the mail in the /$user/ folder where $user is the part before @. No malicious user can access other user's mail since their login will make the webmail system read from "their" folder. Theres no need to config unix permissions since no unauthorized has admin/physical access to the server machine. Good luck. Your attempts at securing the system don't look very promising. Given your setup, all a bad guy needs is a single bug in any of the applications running on the web server, and he has access to all mails on the server. Unix permissions could help you prevent that. About maildir: Maildir are writing the mail to the disk before its parsed. Thats means parsing has to wait until mail is fully delivered. How would you display attachments that have not yet been parsed? Right, that won't work. So you have to wait for the entire mail, no matter what happens. About switching mailservers: I selected postfix because its efficient and it can stream the mail to a mailbox command's STDIN. If I would switch mailserver, I would require that the mailserver can do that. Most mailservers can deliver to procmail or a procmail replacement via a pipe. But that's not the point. Tight integration into the mail server will make it much harder to switch to a different mail server when your current mail server can't handle your future requirements. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re: Get CID inline attachments with MIME::Parser by Anonymous Monk on Nov 28, 2010 at 04:24 UTC
http://search.cpan.org/dist/MIME-tools/MANIFEST examples/mimedump examples/mimeexplode	[reply]
Re^2: Get CID inline attachments with MIME::Parser by sebastiannielsen (Initiate) on Nov 28, 2010 at 23:29 UTC
Thanks. Your solution solved the problem.	[reply]
Re: Get CID inline attachments with MIME::Parser by Anonymous Monk on Nov 28, 2010 at 04:26 UTC
If you wonder, im building a webmail system in perl which does NOT go through the POP3, instead it goes straight from SMTP server into the webmail folder, and it is parsed and complete so the only thing the webmail system has to do is to display the mails and eventual attachments. So an awful lot like the Mail::Box etc	[reply]