Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

performance of File Parsing

by Anonymous Monk
on Jul 07, 2011 at 08:14 UTC ( #913148=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have the tera byte file and want to know to parse the file which one of the below process increases the performance

Method 1:
Open the file, read the line by line and do the some calculation and put the result into the other file and once completed all lines in file, close the file.

Method 2:
Open the file, read all content into the one array variable and close the file after read the each line from array after do the calculation and put the result into the file.

Please tell me which is best way to parse the file , otherwise is there any other way to show best performance.

Sorry, for my bad english

Replies are listed 'Best First'.
Re: performance of File Parsing
by BrowserUk (Patriarch) on Jul 07, 2011 at 08:45 UTC
    I have the tera byte file ...

    Method 1: Will work.

    Method 2: Won't work. (Or will be horribly slow.)

    Although a few of the latest 64-bit processors can theoretically address 1TB of memory, most motherboards are limited to much less. Even the top end, SMP & NUMA processors boards and cards max out at figures of 64/128/256GB of physical memory.

    Whilst it is possible to use swap files to extend the virtual memory available to a process into the TB range, the effect on performance is dire. Instead of reading once, processing and writing result, you (minimally) end up: read from file, write to swap, read from swap, process, write result to disk. Ie. You must do 4 IO ops instead of two and that will at least double your processing time, and usually much worse.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: performance of File Parsing
by roboticus (Chancellor) on Jul 07, 2011 at 10:55 UTC

    Based on your question, I'd propose a different method:

    Method 3:

    Parse the data as it's available: You can use File::Tail to open the file and read the data (even while another program is generating the file). This allows you to continuously read / parse / write. Thus, you can begin processing your data before you have the full terabyte.

    For example, suppose we use the following to generate a stream of data:

    #!/usr/bin/perl # - Slowly generate data use strict; use warnings; open my $OFH, '>', 'the_stream.dat' or die $!; binmode( $OFH, ":unix"); my $cnt = 0; while ($cnt < 100) { ++$cnt; my $cur_time = time; print $OFH "$cnt, $cur_time\n"; sleep 5*rand; } close $OFH;

    Then we can use something like this to read and parse the data while the original is running:

    #!/usr/bin/perl # - Read, parse & print data as it's available use strict; use warnings; use File::Tail; my $IFH = File::Tail->new( name=>"the_stream.dat", tail=>-1, # Start at the beginning ); while (defined(my $line = $IFH->read)) { chomp $line; my $cur_time = time; my ($old_time, $cnt) = split /,\s*/, $line; print "$cur_time data: $old_time, $cnt\n"; }

    Then, when I ran them, the output of was:

    $ perl 1310035999 data: 1, 1310035990 1310035999 data: 2, 1310035992 1310035999 data: 3, 1310035992 1310035999 data: 4, 1310035994 1310035999 data: 5, 1310035994 1310035999 data: 6, 1310035998 1310035999 data: 7, 1310035999 1310035999 data: 8, 1310035999 1310035999 data: 9, 1310035999 1310036001 data: 10, 1310036000 1310036001 data: 11, 1310036000 1310036008 data: 12, 1310036004 1310036014 data: 13, 1310036008 1310036014 data: 14, 1310036010 .....


    When your only tool is a hammer, all problems look like your thumb.

      Thanks for reply.

      But i mean not the parsing the continuously updating file.

      My concern is, the file which have lakhs of records each field separated by semicolon. So i need to parse each record and separate the fields and do the some calculation, based on the satisfy condition need to save result into the different files.

      And also here i need to do some of fields in different records which satisfy the some condition to the aggregation on those fields, for this i am making hash at end of the file do the aggregation and write into the file.

      So on this process for 10 lakhs records, taking time of 3 hours. So i need to do optimize it. So, here not getting idea either reading the line by line of tera byte file taking long time or saving content into memory(hash) at end put into file takes time?

        What’s killing you, then, is “that enormous hash.”   You need to replace that logic.

        If you were to plot the throughput of this program, it would describe a nice, exponential curve.   When it reaches the “thrash point,” it smashes into the wall and drops dead.   That’s my blindfolded prediction, but I’ll bet I’m right on the money.

        I suggest stuffing the whole thing into an SQLite database (flat-file), and using queries (within transactions).

        “Don’t ‘diddle’ the code to make it faster ... find a better algorithm.”
        – Kernighan & Plauger; The Elements of Programming Style.

        Maybe you should show us the key component of your code as a small stand alone script and a very small sample of data that is just sufficient demonstrate what your code does. We can help you much more if we know just what you are trying to do than we can when we have to toss up straw men to pitch at.

        True laziness is hard work
Re: performance of File Parsing
by pklausner (Scribe) on Jul 07, 2011 at 14:21 UTC
Re: performance of File Parsing
by tweetiepooh (Hermit) on Jul 07, 2011 at 11:23 UTC
    Or combine methods and read the file in chunks and process the chunks.
Re: performance of File Parsing
by NiJo (Friar) on Jul 11, 2011 at 18:46 UTC
    Method 3:

    With simple, line based calculations like filtering, counting etc. you could become I/O-bound rather than CPU-bound. Assuming a similar TB size of the output file, much time is wasted by disk seeks. Just look at and hear the disk drive: Noisy sounds and flashing lights are clear indicators of seek problems.

    Two seperate physical disks avoid that. For less than an hours salary your problems could be gone. But do your benchmarks first!

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://913148]
Approved by BrowserUk
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (1)
As of 2022-10-03 23:28 GMT
Find Nodes?
    Voting Booth?
    My preferred way to holiday/vacation is:

    Results (15 votes). Check out past polls.