Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

is perl the best tool for this job ?

by spurperl (Priest)
on Oct 19, 2003 at 13:49 UTC ( #300379=perlquestion: print w/replies, xml ) Need Help??

spurperl has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow monks,

I got a new programming task at work now, and wonder what is the best language/platform to code it in.

The task consists mostly of reading huge binary files (~100 MB), divided to 128 bit chunks, and raving through this heap of bits looking for certain patterns, and validating some information. An additional need is doing some ASCII file work, comparing things, reporting things and reading configurations. There's also have to be some GUI.

My main concern with Perl is speed. I fear that processing binary data in Perl is not fast enough to accomplish the tasks in some normal time. Therefore, I also consider C. But the argument against C is that I'll have to code a GUI (Perl/Tk) and do some text file processing, in which Perl is definitely superior.

If I'll choose Perl eventually, are there any good advice for me ? Any useful CPAN modules ? How is it best to handle 128 bit binary frames in Perl ? How is it best to read such a file ? All of it to memory ? But how ?

Hoping for insights and advices :-)

Replies are listed 'Best First'.
Re: is perl the best tool for this job?(emphatically Yes!)
by BrowserUk (Patriarch) on Oct 19, 2003 at 16:08 UTC

    Perl wins hands down (IMO) for processing fixed length data.

    The following interactive session shows me slurping a 100 MB file into memory and scanning the 6.5 million, 128-bit chunks searching for a bit pattern consisting of 0xffffffffffffffffffffffffffffffff. The pattern is never found, but the entire search took just 49 seconds on my 233MHz machine.

    open F, '<', 'e:\hugefile' or die $^E; binmode F; $file = do{ local $/=\(100*1024*1024); <F> }; print length $file; 104857600 print scalar localtime; for( my $i=0; $i< (100*1024*1024); $i+=16 ) { print $i if substr($file, $i, 16) eq ("\xff" x 16) }; print scalar localtime; Sun Oct 19 16:00:32 2003 Sun Oct 19 16:01:21 2003

    The whole 'program' took maybe 3 or 4 minutes to write. Try doing that with C :-).

    The total memory usage for the program was a little over 110MB.

    The secret when handling large lumps of fixed length data, is not to break them up into an array of 6.5 million little strings. Instead, leaving the data in a single large string and just indexing into it using substr allows very memory and cpu efficient processing.

    You can then use index to search very quicky for fixed patterns, and even regexes applied to the 100MB as a whole, or to the individual fields using substr as an lvalue.

    Your C program may ultimately be a tad more efficient, but I bet it takes you longer to write.

    If you don't have 100MB of ram to spare, processing the file in one pass in a while loop, by setting $/= \(16); is also very fast, and if you need random access, seek and tell make this easy also.


    Update Whilst the code above works well, a few minor tweaks make it run substantially quicker.

    First.

    my $file = do{ local $/=\(100$1024*1204); <FILE> };

    causes a peak memory usage substantially greater than is required. I can only assume that this is because an intermediate buffer is used somewhere. Whilst the extra memory is quickly returned to the OS (under Win anyway), this can be avoided by recoding it as

    my $file; { local $/ = \(100*1024*1024); $file = <FILE>; }

    This has the additional benefit that the load time for the 100MB, from a compressed file on a so-so speed disk, is cut from 15 seconds to 5 seconds.

    Additionally, whilst looping over the string with substr in 16-byte chunks was pretty quick. Using index to search the whole string in a single pass is substantially quicker. An order of magnitude quicker in fact at 4 seconds!

    This does mean that if a match is found, you would have to test the position returned by index modulus 16 to ensure the match didn't span a 16 byte boundary, but the performance gain make the housekeeping worthwhile.

    #! perl -slw use strict; open F, '<', 'e:\100MB' or die $!; binmode F; print 'Slurping... ', scalar localtime; my $file; { local $/ = \(100*1024*1024); $file = <F>; } print 'Slurped. Searching...', scalar localtime; print 'Not found' unless index( $file, "\xff" x 16 ); print 'Searched. ', scalar localtime; __END__ P:\test>junk2 Slurping... Mon Oct 20 00:29:21 2003 Slurped. Searching...Mon Oct 20 00:29:26 2003 Searched. Mon Oct 20 00:29:30 2003

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!

      Thanks for your very insightful answer ! PerlMonks is such a great place because people like you take the time to research into questions and give the great replies as this :-)

      A few points:

      My machine is Win2k on P4 1800 GHz and 256 MB of memory.
      The file loading takes 22 seconds. The first search method (loop with substr takes 2 minutes), the second takes 1 second. What concerns me though, is that neither reported that the sequence "wasn't found" in a completely random binary file (the probability of 0xff x 16 to appear is 2E-128, quite unlikely).
      I wonder about the speed differences... what makes my program run slower on a far stronger PC ? Bad Perl implementation (Activeperl 5.6) ?

      Also, could you please elaborate on the use of the following line:

      $/ = \(100*1024*1024)

      I only used $/ as in "undef $/" to read a file wholly, or to set a record separator. What does your line do ?

      ----------- Update ------------

      I think I figured out the performance problem. My PC usually runs at 190 MB memory usage, which means that the 100MB slurped threw it into the "virtual realm", which naturally downgrades performance.
      Now I read as follows:

      until (eof(FH)) ... read(FH, $val, 128); ...
      And then just compare to the wanted value. The whole process (which now combines reading & testing) takes 28 seconds.

      By the way, I was wrong about other thing. You first method (with substr) works correctly - it doesn't find the string. The second method (with index) seems to do find a string, or at least it thinks so, which is probably wrong (look at your test output, it's obvious there too)

        Sorry, I should have explained that.

        When slurping a complete file, I've found that it pays huge dividends to tell perl how big the file is by setting $/ = \(filesize);. This allows perl to pre-allocate the required memory in a single request to the OS, and then read the file in a single call to the OS.

        If you just set $/ = undef;, perl doesn't know how big a file it is to read, it therefore allocates a pre-determined lump of memory (it appears to be about 16k on my system) and then reads to fill it. It then checks to see if there is more, extends the buffer by 16k and reads the next 16k chunk. With a 100MB file, that requires 6400 allocations/reallocations (with copying) and 6400 reads.

        Needless to say, this is extremely slow with very large files. I'd like to give a comparitive figure here, but I've never had the patience to wait long enough for it to complete on my system. I set it going about 5 minutes before starting typing this reply and it still hasn't finished. A crude measure going by the task manager memory for the process suggests that after 8 minutes it has read around 25%. However, as the amount of memory required at each reallocation grows by 16k each time, so the amount of memory copied each time also grows (slowly). The effect is that the next 25% will take considerably longer than the first, and the next considerably longer again. And the last 25% longer still.

        This is one of those situations where giving perl a helping hand by supplying a little extra information has huge performance bonus.


        With respect to using index. Remember, that it is possible that it will find a pattern that doesn't actually exist in any of your 16-byte chunks. If the last N bytes of one chuck combine with first (16-n) bytes of the next chunk to produce the pattern you are looking for, then index will find that combination and report success.

        It therefore becomes necessary to verify that a 16-byte chunk boundary isn't being straddled. This is as easy as

        my $p = 0; while( ($p % 16) != 0 ) { $p = index( $data, $bit_pattern, $p ); } print "Chunk with bit_pattern: $bit_pattern found at: $p" unless $p = -1;

        That could be done better, but it shows what I mean.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        Hooray!

        read(FH, $val, 128);

        Please keep in mind that read FILEHANDLE,SCALAR,LENGTH

        'Attempts to read LENGTH bytes of data ...' (emphasis mine - from perldoc -f read)

        You mentioned that you're working with 128 bit 'frames', so you'll want to do 16 byte reads in order to get a 'frame'.

        --
        3dan

        $/=$number; tells perl to read that number of bytes instead of lines. so 100*1024*1024 = 100 megs.
Re: is perl the best tool for this job ?
by edan (Curate) on Oct 19, 2003 at 14:22 UTC

    I would think you could use read to read in 128-bit records, and use unpack to 'process' them, and you'll probably find that it's fast enough. As for whether to read it record-by-record, or all-at-once, that probably depends on your needs, and your system's capabilities. 100MB is really not so huge, but then again, for all I know you're running on a Commodore-64, so you might have problems slurping 100MB into core...

    --
    3dan

Re: is perl the best tool for this job ?
by pg (Canon) on Oct 19, 2003 at 16:05 UTC

    I don't know how big your project is. If it is considerably big, it could be a wise decision to use multiple different languages. Of course, your company needs good programmers for each language, to be able to deliver this kind of design, and you need good designers knowing which language can deliver which section the best, and how to glue them together.

    You have to analyze how many components your system will have, and how tightly they are related to each other. Those seriously affect whether multi-language is a good choice, and how each component can be put together, so they deliver the best.

    For what you want to do, most of the main languages deliver all what you required. It is not a difficult decision to make, and it is not a bad decision to use Perl.

    I have used Perl to process huge log files from our production systems, and never saw a problem with it, either memory usgae or speed (seriously it is amazing that there was absolutely no speed problem.)

    It is not very clear whether you need to get into those bits. It is not a problem for Perl to handle bits, but in this sense, c would be a better choice (only regard looking into bits). Looking into bits is quite different from looking into bytes.

    As for GUI, depends on how complex it is, Tk has no problem to deliver, but be careful, Tk code is extreamly difficult to layout and maintain, I tried Tk in one of my project, and to be frank not very impressed. Java is one of the good choices for GUI.

    Last but not the least (absolutely not the least), if you need to do it quick, go with Perl, for sure.

      As for GUI, depends on how complex it is, Tk has no problem to deliver, but be careful, Tk code is extreamly difficult to layout and maintain, I tried Tk in one of my project, and to be frank not very impressed. Java is one of the good choices for GUI.
      How do you figure? Swing and AWT are a mess. You need lines and lines of code to do what you can do in just a few lines in Tk. I'd much rather work with Tk's geometry managers than write all sorts of code to layout a Swing form. Maintenance of Tk isn't "extremely difficult." If you break down your windows into classes, maintenance becomes very simple.

      The computer can't tell you the emotional story. It can give you the exact mathematical design, but what's missing is the eyebrows. - Frank Zappa
Re: is perl the best tool for this job ?
by PodMaster (Abbot) on Oct 19, 2003 at 14:37 UTC
    I'm not sure if it's of interest to you, but check out
    PDL (``Perl Data Language'') gives standard Perl the ability to compactly store and speedily manipulate the large N-dimensional data arrays which are the bread and butter of scientific computing.

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

Re: is perl the best tool for this job ?
by jdtoronto (Prior) on Oct 19, 2003 at 17:56 UTC
    I'm with BrowserUK on this one. Unless you need the sophistication of PDL on this, and it wouldn't seem that you do, then the read it all in approach is really a good one.

    One of my clients collects huge amounts of data, after it is about 6 months old the chance of any single record being required again is about 1 in 10^-6 per year. So we archive the data in pipe-delimited files (more storage efficient than CSV) and store it on an archive server. Files will be from 20 - 800mb each.

    When I took over managing their system about 4 years ago thery were using "find text in a file" from the Windows explorer to look into the archives. I asked how long it takes to search and they said, oh, between 16 and 48 hours! Oops! So I did a very similar thing to the suggestion from BrowserUK and found we could search the archive drive in 2 to 3 hours. The beauty is that I can run about 20 searches concurrently before we see serious degradation (it is a quad xeon box now running linux which is vastly faster than the w2k server it was on until about 12 months ago) and the operators have a little Tk window which sends a request to the server which does the search and sends them back an email when it is done. They love it! Totally of about a day's work.

    IMHO - this is a wonderful task for Perl.

    jdtoronto

Re: is perl the best tool for this job ?
by jonadab (Parson) on Oct 19, 2003 at 21:43 UTC

    Perl _excells_ at looking for patterns in large amounts of data. There's nothing Perl's better at, and no language that's better at it.

    The one thing you mention that's moderately hard in Perl is the GUI. However, your description of the problem makes it sound as if the data manipulation is going to be the bulk of the app and the GUI just has to let the user tell the program what data to manipulate and what to do with it, and may be fairly simple. If that is the case, Perl is an excellent choice for this project.


    $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
Re: is perl the best tool for this job ?
by etj (Chaplain) on Jun 26, 2022 at 01:35 UTC
    Thought from THE FUTURE: MetaCPAN (the best and now, only, way to look up and/or search for things on CPAN) uses ElasticSearch (written in Java) for text-searching - in other words, use the best tool for the job.

    Regarding PDL, I recently cribbed some code in order to add PDL support STL ("Stereolithograpy", a lingua franca format for 3D printing in 2022). When I switched from a Perl loop reading a chunk of bytes then using unpack, to making an ndarray where it sliced the relevant parts of each chunk out and treated those as single-precision data, the data-processing more than doubled in speed. In other words, PDL's data-processing is (not surprisingly) faster than an interpreted language's processing, even using the optimised unpack functionality.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://300379]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2022-10-06 06:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My preferred way to holiday/vacation is:











    Results (26 votes). Check out past polls.

    Notices?