Re: How to improve speed of reading big files

First off, figure out how fast you could possibly go, given the amount of data you have:

sub checkReadTime  # call this with just your list of files
{
    my $linecount = 0;
    my $starttime = time;
    for my $file ( @_ ) {
        my $fh = &openLogFile($file) or next;
        while ( <$fh> ) {
            $linecount++;
        }
        close $fh;
    }
    my $endtime = time;
    warn sprintf( "read %d lines from %d files in %d sec\n",
                  $linecount, scalar @_, $endtime - $starttime );
}
[download]

The difference between the duration reported there and the duration of your current app is the upper bound on how much better you might be able to do.

Apart from that, if you have a serious problem with how long it's taking, maybe you should be doing more with the standard compiled unix tools that do things like sorting. For instance, you could put your filtering step into a separate script, pipe its output to "sort", and pipe the output from sort into whatever script is doing the rest of the work on the sorted lines.

If your app isn't geared to a pipeline operation like this:

  filterLogFiles file.list | sort | your_main_app
[download]

then just put the first part of that into an open statement inside your main app:

    open( my $logsort, "-|", "filterLogFiles @files | sort" ) or die $
+!;
    while ( <$logsort> ) {
        ...
    }
[download]

Of course, you can include option args for your filterLogFiles process so that it can skip the lines that you don't need depending on dates or whatever, and have it output the data in a form that will be easiest to sort (and easy for your main app to digest).

Comment on Re: How to improve speed of reading big files Select or Download Code

Replies are listed 'Best First'.
Re^2: How to improve speed of reading big files by djp (Hermit) on Sep 18, 2009 at 06:03 UTC
I can second the idea of using standard Unix tools. In particular awk(1) can deliver huge performance improvements over perl. We had a text-file processing app which went from 20 minutes to 20 seconds when we recoded in awk. YMMV of course.	[reply]
Re^3: How to improve speed of reading big files by Marshall (Canon) on Sep 18, 2009 at 08:58 UTC
This is unbelievable! Most likely your awk guys didn't know how to write Perl very well. They coded awk more efficiently because that is what they knew how to do.	[reply]
Re^4: How to improve speed of reading big files by djp (Hermit) on Sep 21, 2009 at 02:19 UTC
No it's not unbelievable at all. It was a Perl expert who recoded the app in awk. We don't have 'awk guys', does anyone?	[reply]
Re^5: How to improve speed of reading big files by Marshall (Canon) on Oct 01, 2009 at 05:44 UTC
Re^2: How to improve speed of reading big files by Marshall (Canon) on Sep 18, 2009 at 05:19 UTC
I like graff's benchmark idea. I'd like to point out something that hopefully will happen.... if you run the benchmark the second time, it may speed up!! I don't know what OS the OP has or how big total file size is, but there is a level of caching that the OS does that Perl doesn't even see. As an example, I have one app that works with a dataset of 700 text files. The user can use a search window to find various things in this dataset. The first search takes about 7 seconds. The second search is so fast that it is barely perceptible to the user (>10x performance). The reason why things get faster is that my machine has lots of memory that the OS can use for file cache. The OS knows that these files haven't been modified and there is very little disk access going on. I am running on WinXP which isn't exactly famous for performance or "smarts". The number of things that can affect I/O performance are legion and I have very little information to go on in the OP's case. Anyway in this particular app, I quit optimizing because on user query #8, they have already forgotten that query #1 took a long longer!	[reply]