oko1 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, all -

I'm trying to parse out a 50+MB mbox and show the users an interface for reading and searching it, and I used Mail::MboxParser to do the indexing. Unfortunately, whenever I try to run the CGI that I wrote, it pegs the CPU - and stays there, for a long time. I did a few tests, and it is definitely related to the size of the file:

for n in 100k 1M 10M; do echo -n '*** Processing a' $n 'mailbox: ***'; + head -c $n origami_archive > test; time ./mboxparser.cgi > /dev/null +; done *** Processing a 100k mailbox: *** real 0m1.399s user 0m1.364s sys 0m0.036s *** Processing a 1M mailbox: *** real 0m5.943s user 0m5.880s sys 0m0.060s *** Processing a 10M mailbox: *** real 0m53.079s user 0m52.991s sys 0m0.088s

50MB takes... well, I didn't have the patience. :) Way too long to be useful, anyway.

So, here's my code. I can't see what I'm doing wrong, so I'd really appreciate help!

#!/usr/bin/perl -w # Created by Ben Okopnik on Thu Jan 14 21:55:46 EST 2010 use strict; use Mail::MboxParser; use CGI::Carp qw/fatalsToBrowser warningsToBrowser/; use CGI qw/:standard/; $|++; my $fname = "test"; my $mb = Mail::MboxParser->new( $fname, parseropts => { enable_cache => 1, # enable_grep => 1, cache_file_name => '/tmp/cache' . substr(rand(), 1, 10) } ); my($self) = $0 =~ m{([^/]+)$}; my $count = $mb->nmsgs - 1; binmode STDOUT, ':encoding(UTF-8)'; # Set up utf-8 output print header(-charset => 'utf-8'), start_html( -encoding => 'utf-8', -title => 'Origami Archive'); if (!param('msg')){ my $end; my $incr = 50; my $start = param('start') || 0; my $div; # $start is always going to be $incr * $_ for 0 .. int($count / $i +ncr) # If we're more than $incr posts from the start (i.e., $start is $ +incr or more), # show the "Previous" link if ($start > 0){ my $bottom = $start - $incr; print a({-href=>"$self?start=$bottom"}, "Previous $incr"); $div = " | "; } # If we're >= $incr posts from the end, show the 'Next' link if ($count - $start >= $incr){ my $top = $start + $incr; print $div if $div; print a({-href=>"$self?start=$top"}, "Next $incr"); $end = $top - 1; } else { $end = $count; } print hr; # print "Start: $start End: $end"; # Subscripting one message after the other print "<table>\n"; for my $idx ($start .. $end) { my $msg = $mb->get_message($idx); my %m = %{$msg->header}; print Tr(td(b("&gt;&gt;"), a({-href=>"$self?msg=$idx"}, escapeHTML($m{subject}))), td(escapeHTML($m{from}))), "\n" +; } print "</table>\n"; } else { my $msg = param('msg'); my $prev = $msg - ($msg > 0 ? 1 : 0); my $next = $msg + ($msg < $count ? 1 : 0); print join " | ", ( $msg ? a({-href=>"$self?msg=0"}, "&lt;&lt;") : + "&lt;&lt;" ), ($msg ? a({-href=>"$self?msg=$prev"}, "Previous") : "Previous" +), a({-href=>$self}, "Index"), a({-href=>"$self?msg=$next"}, "Next"), ($msg < $count ? a({-href=>"$self?msg=$count"}, "&gt;&gt;") : +"&gt;&gt;"); print hr, pre($mb->get_message($msg)); } print end_html;

--
"Language shapes the way we think, and determines what we can think about."
-- B. L. Whorf

Replies are listed 'Best First'.
Re: Mail::MboxParser pegs the CPU
by zwon (Abbot) on Jan 15, 2010 at 18:56 UTC

    Well, you can't parse 50MB in a second... You should create index for you mailbox, and load it instead of parsing mailbox for every request. I would use a database for this purpose. But if you want to use the file, take a look onto make_index method in Mail::MboxParser. It would require some additional work, as you probably want to index more information than just message number, but it's a something you can start with.

      Actually, I've been doing that while waiting for a reply here - but I can't quite figure out what's going on. According to the docs, 'make_index' is supposed to run automatically as soon as I exec a 'get_message' method - but the cache file never gets created, no matter what I do (!). I've been trying to figure that out for the past hour or so; still haven't found anything like an answer.

      What I really wish is that I actually understood this process of indexing (I have some hazy conception of saving pointers to message positions within the file, and then reusing those instead of traversing the entire file, but no clue of how to make that work efficiently.) I would have preferred to write that part myself, but had to rely on a module instead.


      --
      "Language shapes the way we think, and determines what we can think about."
      -- B. L. Whorf

        You can use something like this to create index:

        use strict; use warnings; use Mail::MboxParser; my $mb = Mail::MboxParser->new( 'mbox', ); my $ind = $mb->make_index; for ( 0 .. $mb->nmsgs - 1 ) { printf "%5.5d => %10.10d => %s\n", $_, $mb->get_pos($_), $mb->get_message($_)->header->{subject}; }