scottstef has asked for the wisdom of the Perl Monks concerning the following question:

When writing code, which is the perferred way of handling data, for example if I was going to parse thru log files and then format the the data into a report. Which is more efficient on the system:

1.) Grab all the data I need and place it in a temp file and then go back and do the formatting?
2.) Grab the data into a memory buffer and procees each bit individually?
3.) Grab all of the data into memory and then batch process it all to the format I want.

Which is going to places the least stress on the system? Or what is the preferred manner of doing something such as this? I would tend to thing that processing each line of data would seem to be the least stressful on the system. Just wanted to know what everyone else thought.

Replies are listed 'Best First'.
Re: Questions about efficiency
by grinder (Bishop) on Mar 13, 2001 at 20:11 UTC

    It the information is record oriented, and if you don't need to see all the data before starting to process it (hint: think sorting) then the least stressful way on the system is to read a line, deal with it, read a line, deal with it.

    This has the smallest memory footprint.

    If you need to see all of the data before processing any of it (e.g., comparing a value to the arithmetic mean) then try and save only what you need. Rather than checking if a record meets your requirements and then saving it out, reformat it so that you only save what you need, in the format most efficient for subsequent treatments (hint: think epoch seconds)

    If you have little data then you may be able to get away with stashing it in a hash, but sooner or later you will hit a big file, you will eat all your swap and your system will die a horrible lingering death.

    You are better off writing it to another file, and then rereading it in again. On a lightly loaded machine with a modern operating system, most of the file will remain floating around in RAM anyway, so it won't be all that slow to read it back in again.

    update: Code Smarter is on the same wavelength. In fact, the points made there are pretty well the same points I made in my lightning talk @ YAPC::Europe 2000.

Re: Questions about efficiency
by arhuman (Vicar) on Mar 13, 2001 at 18:45 UTC
    It heavily depend on your constraints...
    By example :
    For small files 3) would be the best for me.
    For huge file or when you're not sure about the file size 2) is the most secure...
    To be honest I think that your method 1) is only a subcase of method 2)...(But I may miss something)
Re: Questions about efficiency
by tune (Curate) on Mar 13, 2001 at 20:08 UTC
    I recommend using a fast database to store your data temporarily. Mysql is perfect for this purpose, since it is very fast, and you can do very nice queries on the collected data, and you don't have to use a lot of memory in your script.

    -- tune

      If you don't already have a DBMS installed on your system, or if you don't want to learn DBI, you'll be able to get your code up and running quicker by using one of Perl's DBM modules.

      AnyDBM_File should work on any machine that runs Perl.