in reply to Parsing mhtml attachment reports
From reading the Short Summary of the MHTML Standard, I gather that the difference between HTML and MHTML (HTML embedded in email) is that MHTML URIs can point to other pieces of content embedded in the email. If you do not need to "follow" any links in the MHTML files, then you should be able to use the standard HTML-parsing modules.
I like to use HTML::TokeParser for very small and very large parsing jobs, because the module is both simple and efficient. For medium-sized documents, HTML::TreeBuilder yields much clearer code for many patterns of embedded data.
However, the sample attachment that you posted does not look like HTML at all; it is just plain text. If your .mhtml files really *are* plain text, then this (moderately tested) code should get you started:
#!/usr/bin/perl -w use strict; use File::Basename; use MIME::Parser; my $out_dir = '/tmp'; my $in_file = '/tmp/email-test/AuthenticationReport.msg'; my $output_prefix = basename($0); my $parser = MIME::Parser->new(); $parser->output_dir($out_dir); $parser->output_prefix($output_prefix); $parser->output_to_core(); my $root_entity = $parser->parse_open($in_file) or die "couldn't open/parse MIME stream"; my @entities = recurse_through_entities($root_entity); my $ip_pat = qr{\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}}; my $date_pat = qr{\d{4}-\d{2}-\d{2}}; my $time_pat = qr{\d{2}\.\d{2}\.\d{2}}; foreach my $entity (@entities) { my $file = $entity->bodyhandle->path or warn and next; # next unless $file =~ m{\.mhtml$}i; open MHTML, $file or die; my $no_data; my $address = ''; my $date = ''; while (<MHTML>) { $address = $1 if m{^\s+address ($ip_pat)}; $date = $1 if m{^Disabled Logins for ($date_pat)}; last if m{^\s+Time\s+Address\s*$}; $no_data=1, last if m{^\s*No Data\s*$}; } if ($no_data) { print "$file: No Data!\n"; } else { while (<MHTML>) { chomp; m{^\s+($time_pat)\s+($ip_pat)\s*$} or warn "Parse failed! '$file' - '$_' " and next; my $time = $1; my $ip = $2; print "$file: Time='$time' IP='$ip'\n"; } } close MHTML or die; } # Flatten out any multi-level hierarchies of entities. sub recurse_through_entities { my $ent = shift; my @parts = $ent->parts; if (@parts) { return map { recurse_through_entities($_) } @parts; } else { return $ent; } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Parsing mhtml attachment reports
by chanakya (Friar) on Jan 03, 2005 at 06:12 UTC |