Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: performance of File Parsing

by roboticus (Chancellor)
on Jul 07, 2011 at 10:55 UTC ( [id://913159]=note: print w/replies, xml ) Need Help??


in reply to performance of File Parsing

Based on your question, I'd propose a different method:

Method 3:

Parse the data as it's available: You can use File::Tail to open the file and read the data (even while another program is generating the file). This allows you to continuously read / parse / write. Thus, you can begin processing your data before you have the full terabyte.

For example, suppose we use the following to generate a stream of data:

#!/usr/bin/perl # stream_write.pl - Slowly generate data use strict; use warnings; open my $OFH, '>', 'the_stream.dat' or die $!; binmode( $OFH, ":unix"); my $cnt = 0; while ($cnt < 100) { ++$cnt; my $cur_time = time; print $OFH "$cnt, $cur_time\n"; sleep 5*rand; } close $OFH;

Then we can use something like this to read and parse the data while the original is running:

#!/usr/bin/perl # stream_read.pl - Read, parse & print data as it's available use strict; use warnings; use File::Tail; my $IFH = File::Tail->new( name=>"the_stream.dat", tail=>-1, # Start at the beginning ); while (defined(my $line = $IFH->read)) { chomp $line; my $cur_time = time; my ($old_time, $cnt) = split /,\s*/, $line; print "$cur_time data: $old_time, $cnt\n"; }

Then, when I ran them, the output of stream_read.pl was:

$ perl stream_read.pl 1310035999 data: 1, 1310035990 1310035999 data: 2, 1310035992 1310035999 data: 3, 1310035992 1310035999 data: 4, 1310035994 1310035999 data: 5, 1310035994 1310035999 data: 6, 1310035998 1310035999 data: 7, 1310035999 1310035999 data: 8, 1310035999 1310035999 data: 9, 1310035999 1310036001 data: 10, 1310036000 1310036001 data: 11, 1310036000 1310036008 data: 12, 1310036004 1310036014 data: 13, 1310036008 1310036014 data: 14, 1310036010 .....

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: performance of File Parsing
by Anonymous Monk on Jul 07, 2011 at 11:26 UTC

    Thanks for reply.

    But i mean not the parsing the continuously updating file.

    My concern is, the file which have lakhs of records each field separated by semicolon. So i need to parse each record and separate the fields and do the some calculation, based on the satisfy condition need to save result into the different files.

    And also here i need to do some of fields in different records which satisfy the some condition to the aggregation on those fields, for this i am making hash at end of the file do the aggregation and write into the file.

    So on this process for 10 lakhs records, taking time of 3 hours. So i need to do optimize it. So, here not getting idea either reading the line by line of tera byte file taking long time or saving content into memory(hash) at end put into file takes time?

      What’s killing you, then, is “that enormous hash.”   You need to replace that logic.

      If you were to plot the throughput of this program, it would describe a nice, exponential curve.   When it reaches the “thrash point,” it smashes into the wall and drops dead.   That’s my blindfolded prediction, but I’ll bet I’m right on the money.

      I suggest stuffing the whole thing into an SQLite database (flat-file), and using queries (within transactions).

      “Don’t ‘diddle’ the code to make it faster ... find a better algorithm.”
      – Kernighan & Plauger; The Elements of Programming Style.

      Maybe you should show us the key component of your code as a small stand alone script and a very small sample of data that is just sufficient demonstrate what your code does. We can help you much more if we know just what you are trying to do than we can when we have to toss up straw men to pitch at.

      True laziness is hard work

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://913159]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2024-03-29 08:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found