Most efficient way to build output file?

iKnowNothing has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Most efficient way to build output file? by Zaxo (Archbishop) on Jul 26, 2004 at 23:50 UTC
Writing your new file line by line is just fine. I'd arrange to set `$\`, the output record seperator, to "\n" just to avoid having to print it explicitly. You place the seperator at the front of each print, which I would avoid. You'll probably also want to set `$,` for the output fields. You can get buffered input of your fixed length chunks by setting `local $/ = \1034;`. That will avoid the fussiness of working with read. It'll look like this: `{ local ($\ , $, , $/, $_) = ("\n", "\t", \1034); my $somefmt = 'ccc v V '; # add the rest of the format open my $in, '<', '/path/to/original.dat' or die $!; binmode $in; open my $out, '>', '/path/to/outfile.txt' or die $!; while (<$in>) { my @data = unpack $somefmt, $_; # ... adjust @data print $out @data; } close $out or die $!; close $in or die $!; }` [download] After Compline, Zaxo	[reply] [d/l] [select]
Re: Most efficient way to build output file? by derby (Abbot) on Jul 26, 2004 at 23:41 UTC
Nope. Unless you have some very weird perl build (or you've opened FOutputFile in some odd way), print is build ontop of stdio and that should do enough buffering for you. -derby	[reply]
Re: Most efficient way to build output file? by thor (Priest) on Jul 26, 2004 at 23:43 UTC
I don't know about Windows, but I'm fairly sure that *nix systems buffer output for you. You can observe this for yourself by writing a small program that doesn't do much output and monitoring the output file. You'll see that it stays at zero-size for a while, and then will jump to some multiple of the local notion of block size. So, I guess the short answer is "Yes, printing often is fine. The right thing happens behind the scenes". thor	[reply]
Re: Most efficient way to build output file? by BrowserUk (Patriarch) on Jul 27, 2004 at 00:43 UTC
It takes 3 seconds to read 8 MB of binary data in 1034 byte chunks, unpack it to ascii, and write it as a csv file. Do you have a particular reason for doing this more quickly? Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon	[reply]
Re^2: Most efficient way to build output file? by thor (Priest) on Jul 27, 2004 at 03:24 UTC
It takes 3 seconds... ...on your system. He might be running this on an old Ultra 2 with 16 Kb of RAM. Who knows? Plus, this will hopefully not be the OP's last foray into perl. It's best to know what "best practices" are in the general case. thor	[reply]
Re^3: Most efficient way to build output file? by BrowserUk (Patriarch) on Jul 27, 2004 at 03:36 UTC
Point taken. Although, with 16kb of ram it's doubtful that perl would run, never mind have enough room for extra buffering :) I wasn't trying to be dismissive. If he genuinely has a problem with the performance he is seeing--he's deal with a slow NFS or a network share or similar--then that information might prompt a better alternative. Hence my question. I could have made that clearer. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon	[reply]
Need for speed (sometimes) by bronto (Priest) on Jul 27, 2004 at 18:10 UTC
While I'd suggest to iKnowNothing to try his luck testing a couple of different approaches with some benchmarking module (like Benchmark, for example, that has some coverage on the Camel Book), I would like to tell a small story We had a closed-source mail server that can output a dump of its internal database in plain text, with each record beginning with a `KEY=something` string at the very beginning of a line, all records indented in subsequent lines, and an empty line as a record separator. I had to read two such dumps taken in different days, and output the changes in a format that had to be fed back to a preproduction server. The dumps are many gigabytes big. I wrote and evolved a script that read the two files line by line, doing some comparisons and pattern matchings, and trying to speed it up; I used all the best practices that I knew and tried to keep the code clear and clean. Running it on an old SUN server I could not succeed in make it run in less than 40 minutes I then passed it to a colleague that changed it here and there, eliminating a couple of subroutines and modules; his coding style was rather old-looking to me (it recalled me the old days of Perl 4) and a bit less clear, but it ran in 32 minutes! He tested it on his Linux box, and it took about 11 minutes (mine took 15/16 minutes). Now, 40 minutes was fast enough for us, and so was 32, but since there was a version of the script that was 20% faster than mine, it meant that I could improve it a lot. So, the following day I remembered that I could change the input record separator to "\n\n" and read one record at a time. Moreover, having the whole record in a string I could just match what was interesting, instead of trying different patterns at each line read. Did some benchmarking, changed a couple of subs and rerun the script:16 minutes on the SUN server (yes: a 100% speedup!) This is to say that sometimes you don't really need to make your programs faster, but trying to do it teaches you things that you never cared about --I knew the input-record-separator thing, but I never realized before how to use it to make my job easier and my scripts faster My 2 Eurocents PS: Oooops! Incidentally, I wrote a meditation! Ciao! `--bronto` The very nature of Perl to be like natural language--inconsistant and full of dwim and special cases--makes it impossible to know it all without simply memorizing the documentation (which is not complete or totally correct anyway). --John M. Dlugosz	[reply] [d/l]
Re^2: Most efficient way to build output file? by Anonymous Monk on Jul 27, 2004 at 16:02 UTC
This might have to be adapted to work on a constant stream of data in a real-time environment, so I would like to make it as fast as possible.	[reply]
Re^3: Most efficient way to build output file? by BrowserUk (Patriarch) on Jul 27, 2004 at 16:52 UTC
There are simply too many unstated possibilities to give a good answer. Will the adapted code be Perl? If not, any conclusions drawn based upon perl code are useless. If it is perl, how will perl be built for that environment? Will it use stdio or perlIO? If stdio, will the C runtime library be an ANSI-C complient library? Even the version of Perl makes a difference. In early 5.8.x builds, Perl's IO seemed to loose some of it's 5.6.x performance. In 5.8.4 and 5.8.5, much of that has been regained. I'm not sure how reading from a stream (socket?) will affect things. Logically, if the source is not from disk, that may reduce diskhead thrash, but then, if the stream is slower than the disk, then the stream becomes the limiting factor and buffering the output will have no beneficial effect. Finally, many realtime and realtime capable OSs have asynchronous IO capability. Some will even queue reads and writes and dynamically re-order the queues to minimise head movements. Any attempt to define best practices without fairly detailed knowledge of the underlying systems and mechanisms is not possible. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon	[reply]