Mail::MboxParser pegs the CPU

oko1 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, all -

I'm trying to parse out a 50+MB mbox and show the users an interface for reading and searching it, and I used Mail::MboxParser to do the indexing. Unfortunately, whenever I try to run the CGI that I wrote, it pegs the CPU - and stays there, for a long time. I did a few tests, and it is definitely related to the size of the file:

for n in 100k 1M 10M; do echo -n '*** Processing a' $n 'mailbox: ***';
+ head -c $n origami_archive > test; time ./mboxparser.cgi > /dev/null
+; done 
*** Processing a 100k mailbox: ***
real    0m1.399s
user    0m1.364s
sys    0m0.036s
*** Processing a 1M mailbox: ***
real    0m5.943s
user    0m5.880s
sys    0m0.060s
*** Processing a 10M mailbox: ***
real    0m53.079s
user    0m52.991s
sys    0m0.088s
[download]

50MB takes... well, I didn't have the patience. :) Way too long to be useful, anyway.

So, here's my code. I can't see what I'm doing wrong, so I'd really appreciate help!

#!/usr/bin/perl -w
# Created by Ben Okopnik on Thu Jan 14 21:55:46 EST 2010
use strict;
use Mail::MboxParser;
use CGI::Carp qw/fatalsToBrowser warningsToBrowser/;
use CGI qw/:standard/;
$|++;

my $fname = "test";

my $mb = Mail::MboxParser->new(
    $fname,
    parseropts => {
        enable_cache    => 1,
        # enable_grep     => 1,
        cache_file_name => '/tmp/cache' . substr(rand(), 1, 10)
    }
);

my($self) = $0 =~ m{([^/]+)$};
my $count = $mb->nmsgs - 1;

binmode STDOUT, ':encoding(UTF-8)';     # Set up utf-8 output
print header(-charset => 'utf-8'),
start_html( -encoding => 'utf-8', -title => 'Origami Archive');

if (!param('msg')){
    my $end;
    my $incr = 50;
    my $start = param('start') || 0;
    my $div;

    # $start is always going to be $incr * $_ for 0 .. int($count / $i
+ncr)
    # If we're more than $incr posts from the start (i.e., $start is $
+incr or more),
    # show the "Previous" link
    if ($start > 0){
        my $bottom = $start - $incr;
        print a({-href=>"$self?start=$bottom"}, "Previous $incr");
        $div = " | ";
    }
    # If we're >= $incr posts from the end, show the 'Next' link
    if ($count - $start >= $incr){
        my $top = $start + $incr;
        print $div if $div;
        print a({-href=>"$self?start=$top"}, "Next $incr");
        $end = $top - 1;
    }
    else {
        $end = $count;
    }

    print hr;

    # print "Start: $start End: $end";
    # Subscripting one message after the other
    print "<table>\n";
    for my $idx ($start .. $end) {
        my $msg = $mb->get_message($idx);
        my %m = %{$msg->header};
        print Tr(td(b("&gt;&gt;"), a({-href=>"$self?msg=$idx"},
            escapeHTML($m{subject}))), td(escapeHTML($m{from}))), "\n"
+;
    }
    print "</table>\n";
}
else {
    my $msg = param('msg');
    my $prev = $msg - ($msg > 0 ? 1 : 0);
    my $next = $msg + ($msg < $count ? 1 : 0);
    print join " | ", ( $msg ? a({-href=>"$self?msg=0"}, "&lt;&lt;") :
+ "&lt;&lt;" ),
        ($msg ? a({-href=>"$self?msg=$prev"}, "Previous") : "Previous"
+),
        a({-href=>$self}, "Index"),
        a({-href=>"$self?msg=$next"}, "Next"),
        ($msg < $count ? a({-href=>"$self?msg=$count"}, "&gt;&gt;") : 
+"&gt;&gt;");
    print hr, pre($mb->get_message($msg));
}

print end_html;
[download]

--
"Language shapes the way we think, and determines what we can think about."
-- B. L. Whorf

Comment on Mail::MboxParser pegs the CPU Select or Download Code

Replies are listed 'Best First'.
Re: Mail::MboxParser pegs the CPU by zwon (Abbot) on Jan 15, 2010 at 18:56 UTC
Well, you can't parse 50MB in a second... You should create index for you mailbox, and load it instead of parsing mailbox for every request. I would use a database for this purpose. But if you want to use the file, take a look onto make_index method in Mail::MboxParser. It would require some additional work, as you probably want to index more information than just message number, but it's a something you can start with.	[reply]
Re^2: Mail::MboxParser pegs the CPU by oko1 (Deacon) on Jan 15, 2010 at 19:25 UTC
Actually, I've been doing that while waiting for a reply here - but I can't quite figure out what's going on. According to the docs, 'make_index' is supposed to run automatically as soon as I exec a 'get_message' method - but the cache file never gets created, no matter what I do (!). I've been trying to figure that out for the past hour or so; still haven't found anything like an answer. What I really wish is that I actually understood this process of indexing (I have some hazy conception of saving pointers to message positions within the file, and then reusing those instead of traversing the entire file, but no clue of how to make that work efficiently.) I would have preferred to write that part myself, but had to rely on a module instead. -- "Language shapes the way we think, and determines what we can think about." -- B. L. Whorf	[reply]
Re^3: Mail::MboxParser pegs the CPU by zwon (Abbot) on Jan 15, 2010 at 20:09 UTC
You can use something like this to create index: `use strict; use warnings; use Mail::MboxParser; my $mb = Mail::MboxParser->new( 'mbox', ); my $ind = $mb->make_index; for ( 0 .. $mb->nmsgs - 1 ) { printf "%5.5d => %10.10d => %s\n", $_, $mb->get_pos($_), $mb->get_message($_)->header->{subject}; }` [download]	[reply] [d/l]
Re^4: Mail::MboxParser pegs the CPU by oko1 (Deacon) on Jan 15, 2010 at 20:33 UTC
Re^5: Mail::MboxParser pegs the CPU by zwon (Abbot) on Jan 15, 2010 at 20:39 UTC
Some notes below your chosen depth have not been shown here