in reply to Re: Reliable email parsing
in thread Reliable email parsing

Ok, I'm playing now with Mail::Box. Before trying to parse emails I need to parse my mailbox - to fetch individual emails. So I create simple oneliner which calculate messages from my mbox file 'Mail/-default'.
$ time perl -MMail::Box::Manager -le ' $m=Mail::Box::Manager->new; $f=$m->open(folder=>"Mail/-default"); print scalar $f->messages ' Unexpected end of header (C parser): charset="iso-8859-1" 3554 real 0m8.222s user 0m7.928s sys 0m0.200s
Oops! Mutt say there 3549 message in this file, not 3554... So I've developed own reader for mbox format:
$ time ./mbox_scan Mail/-default 3549 Mail/-default real 0m0.492s user 0m0.408s sys 0m0.080s
Hmm. 16 times faster?! Wow. And correct - it found 3549 messages, just as mutt. So, maybe I misunderstand something about this world, but WHY my parser on pure perl much faster than C parser in mature CPAN module? Ok, maybe that Mail::Box do a lot of additional parsing which I doesn't do, maybe... but why it produce incorrect results?

Here is code of my parser, if interested (sorry, there lines up to 80 columns):

#!/usr/bin/perl use warnings; use strict; sub mbox_read { my ($fh) = @_; my ($from, $msg, $len, $lines, $line); # read 'From ' line: while ($from = <$fh>) { last if $from =~ /\AFrom /; die 'wrong mailbox format or current position' if $from !~ /\A +\r?\n/; } # exit on EOF return if !defined $from; # read email header and detect body size if possible while ($line = <$fh>) { $msg .= $line; last if $line =~ /\A\r?\n/; # end of header if ($line =~ /\AContent-Length:\s+(\d+)\Z/i) { $len = $1; } elsif ($line =~ /\ALines:\s+(\d+)\Z/i) { $lines = $1; } } # read Content-Length: bytes or Lines: lines, if possible if ($len || $lines) { my $read_bytes; if ($len) { $read_bytes = read $fh, $msg, $len, length($msg); die "read: $!" if !defined $read_bytes; } else { $read_bytes = length $msg; $msg .= scalar <$fh> for 1 .. $lines; $read_bytes = length($msg) - $read_bytes; } # check is we really at end of message now, or Content-Length: + is wrong my $extra_bytes = read $fh, my $next_lines, 64; die "read: $!" if !defined $extra_bytes; if ($next_lines =~ /\A[\r\n]*From / || (eof($fh) && $next_lines =~ /\A[\r\n]*\z/)) { seek $fh, -$extra_bytes, 1 or die "seek: +$!"; } else { substr($msg, -$read_bytes, $read_bytes, q{}); seek $fh, -($extra_bytes+$read_bytes), 1 or die "seek: +$!"; goto LINE_BY_LINE; } } # else read line-by-line until 'From ' or EOF else { LINE_BY_LINE: while ($line = <$fh>) { if ($line !~ /\AFrom /) { $msg .= $line; } else { seek $fh, -length($line), 1 or die "seek: +$!"; last; } } # remove last empty string, if any, because it usually belong +to # mbox format instead of message body $msg =~ s/^\r?\n\z//m; } return $msg; } for my $file (@ARGV) { open my $fh, '<', $file or die "open: $!"; my $count = 0; $count++ while mbox_read($fh); close $fh; printf "%5d %s\n", $count, $file; }

Replies are listed 'Best First'.
Re^3: Reliable email parsing
by Fletch (Bishop) on Sep 14, 2006 at 20:25 UTC

    Well, your test mailbox looks to have invalid data (or at least invalid as far as Mail::Box is concerned; looks to be a bad MIME header). Figure out what the offending message is and let the maintainers know.

    As for speed, your code is doing nothing more than reading the message body in; Mail::Box is building up objects to represent each of the messages. That code which (basically) throws away the data it's reading rather than doing anything useful with it runs faster isn't really surprising.

    A reply falls below the community's threshold of quality. You may see it by logging in.