in reply to Parsing mhtml attachment reports

From reading the Short Summary of the MHTML Standard, I gather that the difference between HTML and MHTML (HTML embedded in email) is that MHTML URIs can point to other pieces of content embedded in the email. If you do not need to "follow" any links in the MHTML files, then you should be able to use the standard HTML-parsing modules.

I like to use HTML::TokeParser for very small and very large parsing jobs, because the module is both simple and efficient. For medium-sized documents, HTML::TreeBuilder yields much clearer code for many patterns of embedded data.

However, the sample attachment that you posted does not look like HTML at all; it is just plain text. If your .mhtml files really *are* plain text, then this (moderately tested) code should get you started:

#!/usr/bin/perl -w use strict; use File::Basename; use MIME::Parser; my $out_dir = '/tmp'; my $in_file = '/tmp/email-test/AuthenticationReport.msg'; my $output_prefix = basename($0); my $parser = MIME::Parser->new(); $parser->output_dir($out_dir); $parser->output_prefix($output_prefix); $parser->output_to_core(); my $root_entity = $parser->parse_open($in_file) or die "couldn't open/parse MIME stream"; my @entities = recurse_through_entities($root_entity); my $ip_pat = qr{\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}}; my $date_pat = qr{\d{4}-\d{2}-\d{2}}; my $time_pat = qr{\d{2}\.\d{2}\.\d{2}}; foreach my $entity (@entities) { my $file = $entity->bodyhandle->path or warn and next; # next unless $file =~ m{\.mhtml$}i; open MHTML, $file or die; my $no_data; my $address = ''; my $date = ''; while (<MHTML>) { $address = $1 if m{^\s+address ($ip_pat)}; $date = $1 if m{^Disabled Logins for ($date_pat)}; last if m{^\s+Time\s+Address\s*$}; $no_data=1, last if m{^\s*No Data\s*$}; } if ($no_data) { print "$file: No Data!\n"; } else { while (<MHTML>) { chomp; m{^\s+($time_pat)\s+($ip_pat)\s*$} or warn "Parse failed! '$file' - '$_' " and next; my $time = $1; my $ip = $2; print "$file: Time='$time' IP='$ip'\n"; } } close MHTML or die; } # Flatten out any multi-level hierarchies of entities. sub recurse_through_entities { my $ent = shift; my @parts = $ent->parts; if (@parts) { return map { recurse_through_entities($_) } @parts; } else { return $ent; } }

Replies are listed 'Best First'.
Re^2: Parsing mhtml attachment reports
by chanakya (Friar) on Jan 03, 2005 at 06:12 UTC
    Hi Util, The sample attachment posted is not in plain text. It comes as "text/html".
    Dumping the attachment content yields the following output (I'm sorry to paste the html, bear with me):

    ------=_Part_64_24417480.1094663686411 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: BASE64 ------=_Part_64_24417480.1094663686411 Content-Type: application/octet-stream; name=User_Logins_UnitDaily_Test_User_09_08_2004_10_14_45_DAY20 +04-9-8_18.mhtml Content-Transfer-Encoding: BASE64 Content-Disposition: attachment; filename=User_Logins_UnitDaily_Test_User_09_08_2004_10_14_45_D +AY2004-9-8_18.mhtml snip... @^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^ +@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@Admin_Logins_UnitDaily_Test_User_09_08_2004_10_14_45_DAY2004-9-8_18. +mhtml^@^@^@^@^@^@^@^@^ @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Authent +i^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +^@^@ ^@Admin_Logins_UnitDaily_Test_User_09_08_2004_10_14_45_DAY2004-9-8_18. +mhtml ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +^@^@^@^@^@application/octet-stream^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +^@^@^@^@attachment^@ <snip><snip>... Content-Type: multipart/related; boundary="----=_NextPart_000_0000_01C0D2F6.0C049AB0"; type="text/html" This is a multi-part message in MIME format. ------=_NextPart_000_0000_01C0D2F6.0C049AB0 Content-Type:text/html; charset=utf8 Windows-1252 Content-Transfer-Encoding: quoted-printable <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN"> <html> <head> <title><font color=3Dred size=3D4>Firewall Reports</font> - Authentica +tion/Login</title> <LINK REL=3DSTYLESHEET HREF=3D"/sgms/reports/reports.css" TYPE=3D"text +/css"> </head> <body bgcolor=3D"#FFFFFF"> <SCRIPT LANGUAGE=3D"JavaScript"> document.write(""); document.write("<!--"); var sWidth=3D725; var sHeight=3Dwindow.screen.availHeight-100; if(navigator.appName=3D=3D'Microsoft Internet Explorer') { window.moveTo(15,50); window.resizeTo(sWidth,sHeight);document.write("<div align=3D'center'> +<font size=3D'1' face=3D'Verdana, Arial, Helvetica, sans-serif' color=3D'#000000'><b>Scheduled "); document.write(" report for Cisco at IP address&nbsp;18.1 +87.12.10 (Test1 - 0002321)</b></font></div>"); document.write(" </td>"); document.write(" <td width=3D'35'>&nbsp;</td>"); document.write(" </tr>"); document.write(" <tr>"); document.write(" <td width=3D'20'>&nbsp;</td>"); document.write(" <td width=3D'200'>&nbsp;</td>"); }document.write(" <tr>"); document.write(" <td bgcolor=3D'#CCCCCC' width=3D'759' align=3D'left' +><font color=3D'#000000' face=3D'Verdana, Arial, Helvet ica' size=3D'2'><b>&nbsp; Admin Logins for&nbsp;2004-09-08</b></font>< +/td>"); document.write(" </tr>"); if(ver =3D=3D 4) { document.write("<td nowrap style=3D'white-space: nowrap;padding-right: + 20px;padding-left: 20px;FONT-FAMILY: Helvetica, Arial, Times, Times New Roman;COLOR: #ffffff;BACKGROUND-COLOR: #003399;FONT-SIZE: 8p +t;FONT-STYLE: normal;FONT-WEIGHT: normal;height: 20;' type=3D'text/css'> Time </td><td nowrap style=3D'white-space: nowrap;p +adding-right: 20px;padding-left: 20px;FONT-FAMILY: Helvetica, Arial, Times, Times New Roman;COLOR: #ffffff;BACKGROUND-COLOR: #003399 +;FONT-SIZE: 8pt;FONT-STYLE: normal;FONT-WEIGHT: normal;height: 20;' type=3D'text/css'>Source</td></tr>"); } document.write("<td nowrap>09:31:26</td><td>192.168.128.2</td></tr>"); document.write("<td nowrap>09:31:39</td><td>192.168.128.2</td></tr>"); { document.write("<font class=3DtimezoneDisclaimer color=3D'#000000' fac +e=3D'arial' size=3D'1'>&nbsp;&nbsp;* Reports generated based on data summarized on: 09/08/2004 16:47:38 UTC</font>"); } document.write("<br>"); document.write("<font face=3D'arial' size=3D'1' color=3D'#999999'>&nbs +p;&nbsp;* Report generated in 0.047 secs. </font>"); document.write(""); document.write(""); </SCRIPT> </body> </html>------=_NextPart_000_0000_01C0D2F6.0C049AB0 Content-Type: image/gif Content-Transfer-Encoding: base64 Content-Location: file:///C:/SGMS2/Tomcat/webapps/sgms/images/ubslogo. +gif R0lGODlhYQAmAPQAAP///wMDA5+fn/8AAPLt7UpKSv9eXv+dnSgoKNPT03Nzc/8iIv//// +97e7u7 u//MzAECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAy +H/C01T T0ZGSUNFOS4wGAAAAAxtc09QTVNPRkZJQ0U5LjAQAiDF3gAh/wtNU09GRklDRTkuMBgAAA +AMY21Q <snip><snip>....
    Can you let me know how to parse the required data from the above .

    Thanks well in advance