Re^2: Reliable email parsing

Ok, I'm playing now with Mail::Box. Before trying to parse emails I need to parse my mailbox - to fetch individual emails. So I create simple oneliner which calculate messages from my mbox file 'Mail/-default'.

$ time perl -MMail::Box::Manager -le '
    $m=Mail::Box::Manager->new;
    $f=$m->open(folder=>"Mail/-default");
    print scalar $f->messages
'
Unexpected end of header (C parser):
  charset="iso-8859-1"
3554

real    0m8.222s
user    0m7.928s
sys     0m0.200s
[download]

Oops! Mutt say there 3549 message in this file, not 3554... So I've developed own reader for mbox format:

$ time ./mbox_scan Mail/-default 
 3549 Mail/-default

real    0m0.492s
user    0m0.408s
sys     0m0.080s
[download]

Hmm. 16 times faster?! Wow. And correct - it found 3549 messages, just as mutt. So, maybe I misunderstand something about this world, but WHY my parser on pure perl much faster than C parser in mature CPAN module? Ok, maybe that Mail::Box do a lot of additional parsing which I doesn't do, maybe... but why it produce incorrect results?

Here is code of my parser, if interested (sorry, there lines up to 80 columns):

#!/usr/bin/perl
use warnings;
use strict;

sub mbox_read {
    my ($fh) = @_;
    my ($from, $msg, $len, $lines, $line);
    # read 'From ' line:
    while ($from = <$fh>) {
        last if $from =~ /\AFrom /;
        die 'wrong mailbox format or current position' if $from !~ /\A
+\r?\n/;
    }
    # exit on EOF
    return if !defined $from;
    # read email header and detect body size if possible
    while ($line = <$fh>) {
        $msg .= $line;
        last if $line =~ /\A\r?\n/;    # end of header
        if ($line =~ /\AContent-Length:\s+(\d+)\Z/i) {
            $len = $1;
        }
        elsif ($line =~ /\ALines:\s+(\d+)\Z/i) {
            $lines = $1;
        }
    }
    # read Content-Length: bytes or Lines: lines, if possible
    if ($len || $lines) {
        my $read_bytes;
        if ($len) {
            $read_bytes = read $fh, $msg, $len, length($msg);
            die "read: $!" if !defined $read_bytes;
        }
        else {
            $read_bytes = length $msg;
            $msg .= scalar <$fh> for 1 .. $lines;
            $read_bytes = length($msg) - $read_bytes;
        }
        # check is we really at end of message now, or Content-Length:
+ is wrong
        my $extra_bytes = read $fh, my $next_lines, 64;
        die "read: $!" if !defined $extra_bytes;
        if ($next_lines =~ /\A[\r\n]*From / ||
                (eof($fh) && $next_lines =~ /\A[\r\n]*\z/)) {
            seek $fh, -$extra_bytes, 1                  or die "seek: 
+$!";
        }
        else {
            substr($msg, -$read_bytes, $read_bytes, q{});
            seek $fh, -($extra_bytes+$read_bytes), 1    or die "seek: 
+$!";
            goto LINE_BY_LINE;
        }
    }
    # else read line-by-line until 'From ' or EOF
    else {
        LINE_BY_LINE:
        while ($line = <$fh>) {
            if ($line !~ /\AFrom /) {
                $msg .= $line;
            }
            else {
                seek $fh, -length($line), 1             or die "seek: 
+$!";
                last;
            }
        }
        # remove last empty string, if any, because it usually belong 
+to
        # mbox format instead of message body
        $msg =~ s/^\r?\n\z//m;
    }
    return $msg;
}

for my $file (@ARGV) {
    open my $fh, '<', $file or die "open: $!";
    my $count = 0;
    $count++ while mbox_read($fh);
    close $fh;
    printf "%5d %s\n", $count, $file;
}
[download]

Comment on Re^2: Reliable email parsing Select or Download Code

Replies are listed 'Best First'.
Re^3: Reliable email parsing by Fletch (Bishop) on Sep 14, 2006 at 20:25 UTC
Well, your test mailbox looks to have invalid data (or at least invalid as far as Mail::Box is concerned; looks to be a bad MIME header). Figure out what the offending message is and let the maintainers know. As for speed, your code is doing nothing more than reading the message body in; Mail::Box is building up objects to represent each of the messages. That code which (basically) throws away the data it's reading rather than doing anything useful with it runs faster isn't really surprising.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.