Parsing mhtml attachment reports

chanakya has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
I'm a newbie to Perl. I seek the wisdom from you for the following problem:

I want to parse the incoming emails. These emails have three attachments. First I assumed that the email attachments are in the format of text/html
but later found that the attachments are in ".mhtml" format.
I wrote a script which parses the email, gets the attachment names and writes the content-type into a /tmp file


#!/usr/local/bin/perl -w

use MIME::Parser;
use File::Basename;

use strict;

# derive the base filenames of extracted parts from
# the name of the script. 
my ($parsed) = (basename($0))[0];

#Create a Parser object
my $parser = MIME::Parser->new();

# output directory for parsed files
$parser->output_dir("/tmp");

# Basenames for parsed files
$parser->output_prefix($parsed);

$parser->output_to_core();

open(INPUT, "/tmp/email-test/AuthenticationReport.msg") or die("Input 
+error: $!");

my $entity = $parser->read(\*INPUT)
    or die "couldn't parse MIME stream";
close(INPUT);

# Tell us about the MIME entities! this is for debugging

$entity->dump_skeleton;
[download]

Running the above script results in a /tmp file with the following results:

------=_Part_64_24417480.1094663686411
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: BASE64

------=_Part_64_24417480.1094663686411
Content-Type: application/octet-stream; 
        name=User_Logins_UnitDaily_Test_User_09_08_2004_10_14_45_DAY20
+04-9-8_18.mhtml
Content-Transfer-Encoding: BASE64
Content-Disposition: attachment; 
        filename=User_Logins_UnitDaily_Test_User_09_08_2004_10_14_45_D
+AY2004-9-8_18.mhtml

------=_Part_64_24417480.1094663686411
Content-Type: application/octet-stream; 
        name=Admin_Logins_UnitDaily_Test_User_09_08_2004_10_14_45_DAY2
+004-9-8_18.mhtml
Content-Transfer-Encoding: BASE64
Content-Disposition: attachment; 
        filename=Admin_Logins_UnitDaily_Test_User_09_08_2004_10_14_45_
+DAY2004-9-8_18.mhtml

------=_Part_64_24417480.1094663686411
Content-Type: application/octet-stream; 
        name=Disabled_Logins_UnitDaily_Test_User_09_08_2004_10_14_45_D
+AY2004-9-8_18.mhtml
Content-Transfer-Encoding: BASE64
Content-Disposition: attachment; 
        filename=Disabled_Logins_UnitDaily_Test_User_09_08_2004_10_14_
+45_DAY2004-9-8_18.mhtml


------=_Part_64_24417480.1094663686411--
[download]

Moving on, I want to parse the three attachments and the data within them and insert the data into the database (Oracle).
The data within the email attachment comes in a fixed format which will like something like below
The following is a sample for the attachment "Disabled_Logins_UnitDaily...."


            Cisco Reports
            -------------
        Scheduled report for Cisco at IP
        address 18.187.12.10(Test1 - 0002321)
-------------------------------------------------------------------

Disabled Logins for 2004-10-02

        Time        Address
        10.10.58    18.188.10.12
        10.12.34    17.199.13.100
[download]

IF there is not activity then the attachment displays "No Data" instead of "Time" and "Address".
The fields that I essentially need to extract are the IP address in the title, Date, Time and Address
Can anyone let me know how do I parse a ".mhtml" attachment and extract the required data.

Thanks well in advance.

Comment on Parsing mhtml attachment reports Select or Download Code

Replies are listed 'Best First'.
Re: Parsing mhtml attachment reports by Util (Priest) on Jan 02, 2005 at 20:45 UTC
From reading the Short Summary of the MHTML Standard, I gather that the difference between HTML and MHTML (HTML embedded in email) is that MHTML URIs can point to other pieces of content embedded in the email. If you do not need to "follow" any links in the MHTML files, then you should be able to use the standard HTML-parsing modules. I like to use HTML::TokeParser for very small and very large parsing jobs, because the module is both simple and efficient. For medium-sized documents, HTML::TreeBuilder yields much clearer code for many patterns of embedded data. However, the sample attachment that you posted does not look like HTML at all; it is just plain text. If your .mhtml files really are plain text, then this (moderately tested) code should get you started: #!/usr/bin/perl -w use strict; use File::Basename; use MIME::Parser; my $out_dir = '/tmp'; my $in_file = '/tmp/email-test/AuthenticationReport.msg'; my $output_prefix = basename($0); my $parser = MIME::Parser->new(); $parser->output_dir($out_dir); $parser->output_prefix($output_prefix); $parser->output_to_core(); my $root_entity = $parser->parse_open($in_file) or die "couldn't open/parse MIME stream"; my @entities = recurse_through_entities($root_entity); my $ip_pat = qr{\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}}; my $date_pat = qr{\d{4}-\d{2}-\d{2}}; my $time_pat = qr{\d{2}\.\d{2}\.\d{2}}; foreach my $entity (@entities) { my $file = $entity->bodyhandle->path or warn and next; # next unless $file =~ m{\.mhtml$}i; open MHTML, $file or die; my $no_data; my $address = ''; my $date = ''; while (<MHTML>) { $address = $1 if m{^\s+address ($ip_pat)}; $date = $1 if m{^Disabled Logins for ($date_pat)}; last if m{^\s+Time\s+Address\s$}; $no_data=1, last if m{^\sNo Data\s$}; } if ($no_data) { print "$file: No Data!\n"; } else { while (<MHTML>) { chomp; m{^\s+($time_pat)\s+($ip_pat)\s$} or warn "Parse failed! '$file' - '$_' " and next; my $time = $1; my $ip = $2; print "$file: Time='$time' IP='$ip'\n"; } } close MHTML or die; } # Flatten out any multi-level hierarchies of entities. sub recurse_through_entities { my $ent = shift; my @parts = $ent->parts; if (@parts) { return map { recurse_through_entities($_) } @parts; } else { return $ent; } } [download]	[reply] [d/l]
Re^2: Parsing mhtml attachment reports by chanakya (Friar) on Jan 03, 2005 at 06:12 UTC
Hi Util, The sample attachment posted is not in plain text. It comes as "text/html". Dumping the attachment content yields the following output (I'm sorry to paste the html, bear with me): ------=_Part_64_24417480.1094663686411 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: BASE64 ------=_Part_64_24417480.1094663686411 Content-Type: application/octet-stream; name=User_Logins_UnitDaily_Test_User_09_08_2004_10_14_45_DAY20 +04-9-8_18.mhtml Content-Transfer-Encoding: BASE64 Content-Disposition: attachment; filename=User_Logins_UnitDaily_Test_User_09_08_2004_10_14_45_D +AY2004-9-8_18.mhtml snip... @^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^ +@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@Admin_Logins_UnitDaily_Test_User_09_08_2004_10_14_45_DAY2004-9-8_18. +mhtml^@^@^@^@^@^@^@^@^ @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Authent +i^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +^@^@ ^@Admin_Logins_UnitDaily_Test_User_09_08_2004_10_14_45_DAY2004-9-8_18. +mhtml ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +^@^@^@^@^@application/octet-stream^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +^@^@^@^@attachment^@ <snip><snip>... Content-Type: multipart/related; boundary="----=_NextPart_000_0000_01C0D2F6.0C049AB0"; type="text/html" This is a multi-part message in MIME format. ------=_NextPart_000_0000_01C0D2F6.0C049AB0 Content-Type:text/html; charset=utf8 Windows-1252 Content-Transfer-Encoding: quoted-printable <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN"> <html> <head> <title><font color=3Dred size=3D4>Firewall Reports</font> - Authentica +tion/Login</title> <LINK REL=3DSTYLESHEET HREF=3D"/sgms/reports/reports.css" TYPE=3D"text +/css"> </head> <body bgcolor=3D"#FFFFFF"> <SCRIPT LANGUAGE=3D"JavaScript"> document.write(""); document.write("<!--"); var sWidth=3D725; var sHeight=3Dwindow.screen.availHeight-100; if(navigator.appName=3D=3D'Microsoft Internet Explorer') { window.moveTo(15,50); window.resizeTo(sWidth,sHeight);document.write("<div align=3D'center'> +<font size=3D'1' face=3D'Verdana, Arial, Helvetica, sans-serif' color=3D'#000000'><b>Scheduled "); document.write(" report for Cisco at IP address 18.1 +87.12.10 (Test1 - 0002321)</b></font></div>"); document.write(" </td>"); document.write(" <td width=3D'35'> </td>"); document.write(" </tr>"); document.write(" <tr>"); document.write(" <td width=3D'20'> </td>"); document.write(" <td width=3D'200'> </td>"); }document.write(" <tr>"); document.write(" <td bgcolor=3D'#CCCCCC' width=3D'759' align=3D'left' +><font color=3D'#000000' face=3D'Verdana, Arial, Helvet ica' size=3D'2'><b>  Admin Logins for 2004-09-08</b></font>< +/td>"); document.write(" </tr>"); if(ver =3D=3D 4) { document.write("<td nowrap style=3D'white-space: nowrap;padding-right: + 20px;padding-left: 20px;FONT-FAMILY: Helvetica, Arial, Times, Times New Roman;COLOR: #ffffff;BACKGROUND-COLOR: #003399;FONT-SIZE: 8p +t;FONT-STYLE: normal;FONT-WEIGHT: normal;height: 20;' type=3D'text/css'> Time </td><td nowrap style=3D'white-space: nowrap;p +adding-right: 20px;padding-left: 20px;FONT-FAMILY: Helvetica, Arial, Times, Times New Roman;COLOR: #ffffff;BACKGROUND-COLOR: #003399 +;FONT-SIZE: 8pt;FONT-STYLE: normal;FONT-WEIGHT: normal;height: 20;' type=3D'text/css'>Source</td></tr>"); } document.write("<td nowrap>09:31:26</td><td>192.168.128.2</td></tr>"); document.write("<td nowrap>09:31:39</td><td>192.168.128.2</td></tr>"); { document.write("<font class=3DtimezoneDisclaimer color=3D'#000000' fac +e=3D'arial' size=3D'1'>  * Reports generated based on data summarized on: 09/08/2004 16:47:38 UTC</font>"); } document.write("<br>"); document.write("<font face=3D'arial' size=3D'1' color=3D'#999999'>&nbs +p; * Report generated in 0.047 secs. </font>"); document.write(""); document.write(""); </SCRIPT> </body> </html>------=_NextPart_000_0000_01C0D2F6.0C049AB0 Content-Type: image/gif Content-Transfer-Encoding: base64 Content-Location: file:///C:/SGMS2/Tomcat/webapps/sgms/images/ubslogo. +gif R0lGODlhYQAmAPQAAP///wMDA5+fn/8AAPLt7UpKSv9eXv+dnSgoKNPT03Nzc/8iIv//// +97e7u7 u//MzAECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAy +H/C01T T0ZGSUNFOS4wGAAAAAxtc09QTVNPRkZJQ0U5LjAQAiDF3gAh/wtNU09GRklDRTkuMBgAAA +AMY21Q <snip><snip>.... [download] Can you let me know how to parse the required data from the above . Thanks well in advance	[reply] [d/l]
Re: Parsing mhtml attachment reports by Fletch (Bishop) on Jan 02, 2005 at 20:30 UTC
"mhtml" is a common extension for files containing HTML::Mason components. What you've shown looks like plain ASCII text (which was probably produced using Mason), so just write something to parse out the information you're interested in from the text. `my( $ip, $date, @data ); my $seen_time_address_header = undef; while( <INFILE> ) { if( /address (\d+\.\d+\.\d+\.\d+$.*$)/ ) { $ip = $1; } if( /Logins for (\S+)/ ) { $date = $1; } if( /^\s+Time\s+Address/ ) { $seen_time_address_header = 1; } if( $seen_time_address_header and /((?:\d+\.?){3})\s+(\S+)/ ) { push @data, [ $1, $2 ]; } }` [download]	[reply] [d/l]
Re: Parsing mhtml attachment reports by Aristotle (Chancellor) on Jan 02, 2005 at 22:49 UTC
The temporary file is itself a multipart MIME message. Putting it through the MIME parser will give you pieces of regular HTML you can put through any parser. (I recommend a look at Ovid's HTML::TokeParser::Simple.) (On closer look that's what Util is doing, except he didn't say so explicitly. So look at his code.) Makeshifts last the longest.	[reply]