Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Perl wins hands down (IMO) for processing fixed length data.

The following interactive session shows me slurping a 100 MB file into memory and scanning the 6.5 million, 128-bit chunks searching for a bit pattern consisting of 0xffffffffffffffffffffffffffffffff. The pattern is never found, but the entire search took just 49 seconds on my 233MHz machine.

open F, '<', 'e:\hugefile' or die $^E; binmode F; $file = do{ local $/=\(100*1024*1024); <F> }; print length $file; 104857600 print scalar localtime; for( my $i=0; $i< (100*1024*1024); $i+=16 ) { print $i if substr($file, $i, 16) eq ("\xff" x 16) }; print scalar localtime; Sun Oct 19 16:00:32 2003 Sun Oct 19 16:01:21 2003

The whole 'program' took maybe 3 or 4 minutes to write. Try doing that with C :-).

The total memory usage for the program was a little over 110MB.

The secret when handling large lumps of fixed length data, is not to break them up into an array of 6.5 million little strings. Instead, leaving the data in a single large string and just indexing into it using substr allows very memory and cpu efficient processing.

You can then use index to search very quicky for fixed patterns, and even regexes applied to the 100MB as a whole, or to the individual fields using substr as an lvalue.

Your C program may ultimately be a tad more efficient, but I bet it takes you longer to write.

If you don't have 100MB of ram to spare, processing the file in one pass in a while loop, by setting $/= \(16); is also very fast, and if you need random access, seek and tell make this easy also.


Update Whilst the code above works well, a few minor tweaks make it run substantially quicker.

First.

my $file = do{ local $/=\(100$1024*1204); <FILE> };

causes a peak memory usage substantially greater than is required. I can only assume that this is because an intermediate buffer is used somewhere. Whilst the extra memory is quickly returned to the OS (under Win anyway), this can be avoided by recoding it as

my $file; { local $/ = \(100*1024*1024); $file = <FILE>; }

This has the additional benefit that the load time for the 100MB, from a compressed file on a so-so speed disk, is cut from 15 seconds to 5 seconds.

Additionally, whilst looping over the string with substr in 16-byte chunks was pretty quick. Using index to search the whole string in a single pass is substantially quicker. An order of magnitude quicker in fact at 4 seconds!

This does mean that if a match is found, you would have to test the position returned by index modulus 16 to ensure the match didn't span a 16 byte boundary, but the performance gain make the housekeeping worthwhile.

#! perl -slw use strict; open F, '<', 'e:\100MB' or die $!; binmode F; print 'Slurping... ', scalar localtime; my $file; { local $/ = \(100*1024*1024); $file = <F>; } print 'Slurped. Searching...', scalar localtime; print 'Not found' unless index( $file, "\xff" x 16 ); print 'Searched. ', scalar localtime; __END__ P:\test>junk2 Slurping... Mon Oct 20 00:29:21 2003 Slurped. Searching...Mon Oct 20 00:29:26 2003 Searched. Mon Oct 20 00:29:30 2003

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!


In reply to Re: is perl the best tool for this job?(emphatically Yes!) by BrowserUk
in thread is perl the best tool for this job ? by spurperl

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-04-23 16:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found