in reply to Most efficient way to build output file?

It takes 3 seconds to read 8 MB of binary data in 1034 byte chunks, unpack it to ascii, and write it as a csv file.

Do you have a particular reason for doing this more quickly?


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
  • Comment on Re: Most efficient way to build output file?

Replies are listed 'Best First'.
Re^2: Most efficient way to build output file?
by thor (Priest) on Jul 27, 2004 at 03:24 UTC
    It takes 3 seconds...
    ...on your system. He might be running this on an old Ultra 2 with 16 Kb of RAM. Who knows? Plus, this will hopefully not be the OP's last foray into perl. It's best to know what "best practices" are in the general case.

    thor

      Point taken. Although, with 16kb of ram it's doubtful that perl would run, never mind have enough room for extra buffering :)

      I wasn't trying to be dismissive. If he genuinely has a problem with the performance he is seeing--he's deal with a slow NFS or a network share or similar--then that information might prompt a better alternative. Hence my question.

      I could have made that clearer.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Need for speed (sometimes)
by bronto (Priest) on Jul 27, 2004 at 18:10 UTC

    While I'd suggest to iKnowNothing to try his luck testing a couple of different approaches with some benchmarking module (like Benchmark, for example, that has some coverage on the Camel Book), I would like to tell a small story

    We had a closed-source mail server that can output a dump of its internal database in plain text, with each record beginning with a KEY=something string at the very beginning of a line, all records indented in subsequent lines, and an empty line as a record separator. I had to read two such dumps taken in different days, and output the changes in a format that had to be fed back to a preproduction server. The dumps are many gigabytes big.

    I wrote and evolved a script that read the two files line by line, doing some comparisons and pattern matchings, and trying to speed it up; I used all the best practices that I knew and tried to keep the code clear and clean. Running it on an old SUN server I could not succeed in make it run in less than 40 minutes

    I then passed it to a colleague that changed it here and there, eliminating a couple of subroutines and modules; his coding style was rather old-looking to me (it recalled me the old days of Perl 4) and a bit less clear, but it ran in 32 minutes! He tested it on his Linux box, and it took about 11 minutes (mine took 15/16 minutes).

    Now, 40 minutes was fast enough for us, and so was 32, but since there was a version of the script that was 20% faster than mine, it meant that I could improve it a lot.

    So, the following day I remembered that I could change the input record separator to "\n\n" and read one record at a time. Moreover, having the whole record in a string I could just match what was interesting, instead of trying different patterns at each line read. Did some benchmarking, changed a couple of subs and rerun the script:16 minutes on the SUN server (yes: a 100% speedup!)

    This is to say that sometimes you don't really need to make your programs faster, but trying to do it teaches you things that you never cared about --I knew the input-record-separator thing, but I never realized before how to use it to make my job easier and my scripts faster

    My 2 Eurocents

    PS: Oooops! Incidentally, I wrote a meditation!

    Ciao!
    --bronto


    The very nature of Perl to be like natural language--inconsistant and full of dwim and special cases--makes it impossible to know it all without simply memorizing the documentation (which is not complete or totally correct anyway).
    --John M. Dlugosz
Re^2: Most efficient way to build output file?
by Anonymous Monk on Jul 27, 2004 at 16:02 UTC
    This might have to be adapted to work on a constant stream of data in a real-time environment, so I would like to make it as fast as possible.

      There are simply too many unstated possibilities to give a good answer.

      Will the adapted code be Perl? If not, any conclusions drawn based upon perl code are useless.

      If it is perl, how will perl be built for that environment?

      Will it use stdio or perlIO?

      If stdio, will the C runtime library be an ANSI-C complient library?

      Even the version of Perl makes a difference. In early 5.8.x builds, Perl's IO seemed to loose some of it's 5.6.x performance. In 5.8.4 and 5.8.5, much of that has been regained.

      I'm not sure how reading from a stream (socket?) will affect things. Logically, if the source is not from disk, that may reduce diskhead thrash, but then, if the stream is slower than the disk, then the stream becomes the limiting factor and buffering the output will have no beneficial effect.

      Finally, many realtime and realtime capable OSs have asynchronous IO capability. Some will even queue reads and writes and dynamically re-order the queues to minimise head movements. Any attempt to define best practices without fairly detailed knowledge of the underlying systems and mechanisms is not possible.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon