Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Fastest way to get byte offsets of a string using tell

by chanakya (Friar)
on Mar 12, 2009 at 12:32 UTC ( [id://750144]=perlquestion: print w/replies, xml ) Need Help??

chanakya has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I have a sample script to get the byte offset of a given string from an archive file.
The script reads a large archive file around 200GB and gets the byte offset at which the string was found.

Below is the sample script. This basically reads the archive file, which is a collection of email messages.
These email messages start with string "From " and end with a blank line. The marker here is "From" for every email
my $infil = '20090101.arch'; my $begPat = '^From '; my $INFILE = new IO::File(); open($INFILE, "< $infil") or die("$infil: $!"); my $Num = 0; my ($line, $subject); while(1) { my $startOffset = tell($INFILE); $startOffset-=scalar(length($line)) if($line =~ /$begPat/); while($line = <$INFILE>) { $subject = $line if($line =~ /^Subject:/); last; } } close($INFILE);
The syntax of the archive file is as below.
From 10.1.60.40 Accept: */* Transfer-Encoding: chunked Subject: Test mail 1 Mime-Version: 1.0 Content-Type: multipart/mixed; Content-Type: text/plain xxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxxxxxx Content-Type: application/x-gzip Content-Disposition: attachment; filename="Sampledata.gz" Content-Transfer-Encoding: base64 H4sIAAAAAAAAA+y9bXOkOLbv+74+BSfiztyeuG0Xz0nWnuu4rqfumq6nXa7qntlxIjIwiW +1O From 10.60.128.140 Accept: */* Transfer-Encoding: chunked Subject: Test mail 2 Mime-Version: 1.0 Content-Type: multipart/mixed; Content-Type: text/plain xxxxxx xxxxxx xxxxxx
Basically I want a script do the following
1) Should get the Subject and the byte offsets of every email from the archive(offset of "From " and offset of End of email)
2) The script should be very fast to read the archive files and print out the information

Please suggest the best possible ways to achieve the same. I want the script to complete the processing in the minimum time possible.

Thanks for your time.

Replies are listed 'Best First'.
Re: Fastest way to get byte offsets of a string using tell
by repellent (Priest) on Mar 12, 2009 at 13:42 UTC
    Assuming the end of an email is the beginning of the next "From ", you only need one offset per email:
    use warnings; use strict; my %subject_offset; my $last_offset = 0; { # divide file into email chunks local $/ = "\n\nFrom "; # note single whitespace at the end open(my $FH, "<", "20090101.arch") or die($!); while (my $block = <$FH>) { if ($block =~ /^.*?\nSubject: (.*?)\n/s) { $subject_offset{$1} = $last_offset; $last_offset = tell($FH) - 5; # minus length("From ") } } } use Data::Dumper; print Dumper(\%subject_offset); __END__ $VAR1 = { 'Test mail 1' => 0, 'Test mail 2' => 395 };
Re: Fastest way to get byte offsets of a string using tell
by ig (Vicar) on Mar 12, 2009 at 15:07 UTC

    The example you give looks like an "mbox" format message file. Parsing such files can be a bit tricky. I would look for a module to parse the message file. While I haven't used it, you might consider Mail::Box-Overview. Otherwise I suggest a search of CPAN for mailbox.

    If you are getting offsets to build an index to solve performance problems accessing large "mbox" style mail files, then I suggest you find an existing module or application for doing so or change the format entirely. It might be best to move the messages into a database, for example. This isn't a new problem and there are working solutions available (i.e. existing MUAs and MTAs). You don't need to re-invent the wheel.

Re: Fastest way to get byte offsets of a string using tell
by bellaire (Hermit) on Mar 12, 2009 at 13:20 UTC
    It looks like you already have a script that does most of what you want. You're already getting the offset of the line.

    For the first question regarding offsets, what have you tried, and why didn't it work? Your script stops short of making any effort to actually do this. This isn't a place where you can write half a script and then expect us to finish it, so what specifically is stopping you from accomplishing this?

    As to optimization, never do it before you've done profiling. If your script is not running acceptably fast, find out where the slow parts are and address them. You need to do profiling to identify the slow parts. Do you need direction as to how to perform profiling?
Re: Fastest way to get byte offsets of a string using tell
by Anonymous Monk on Mar 12, 2009 at 12:53 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://750144]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2024-04-19 13:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found