kepler has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I have several files which I want to concatenate in a single file. My problem is that they are big (1 Gigabyte each). What might be the better, accurate and fastest process to do this? Thanks in advance.

Kepler

Replies are listed 'Best First'.
Re: Append big files
by BrowserUk (Patriarch) on Sep 14, 2016 at 23:01 UTC

    The fastest way under Windows is: copy file1/b + file2/b + file3/b allfiles/b

    If you can arrange for allfiles to be on a different device (drive) to the rest -- eg. create it on an ssd -- then it will be quicker than if all the files are on the same device(*).

    * not necessarily true if the one device is actually a set of raided multiple drives.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Append big files
by kcott (Archbishop) on Sep 15, 2016 at 00:32 UTC

    G'day kepler,

    I ran some tests on this. I used a 1GB source file (text_G_1), consisting of 10,000,000 identical 100 byte records, which I concatenated 10 times to give a 10GB output file. I performed the concatenation two ways:

    • SLURP MODE: read entire source file into memory then append it, in its entirety, to the output file. This averaged a little under 10s per file. If you have 1GB of memory free, this is probably the fastest.
    • RECORD MODE: read individual records from the source then append each to the output. This averaged a little over 10s per file. This could be faster if you're short on memory. I'd expect this to vary depending on the number of records in your source files.

    Which is better rather depends on what you mean by that. I've already addressed speed: is faster better? If slurp mode hogs memory, then record mode might be considered better.

    I don't see any issue with accuracy. What sort of inaccuracies did you envisage?

    Also note that paragraph mode, or some other type of block mode, may be a better fit for you. Without knowing what your data looks like, it's impossible to tell. See $/ in perlvar for more information about this.

    The code I used, along with a representative output, is below. I suggest you run similar tests and then, based on your data, system, and any other local considerations, determine what optimally suits your situation.

    #!/usr/bin/env perl use strict; use warnings; use autodie qw{:all}; use Time::HiRes qw{time}; my $source_file = "$ENV{HOME}/local/dev/test_data/text_G_1"; my $concat_file = 'pm_1171789_concat_files.out'; my $files_to_concat = 10; print "Source file:\n"; system "ls -l $source_file"; { print "*** SLURP MODE ***\n"; my $t0 = time; my ($t1, $t2); open my $out_fh, '>', $concat_file; for my $file_number (1 .. $files_to_concat) { local $/; $t1 = time; open my $in_fh, '<', $source_file; print {$out_fh} <$in_fh>; $t2 = time; print "Duration (file $file_number): ", $t2 - $t1, " seconds\n +"; } print 'Total Duration: ', $t2 - $t0, " seconds\n"; print 'Average Duration: ', ($t2 - $t0) / $files_to_concat, " seco +nds\n"; print "Concat file:\n"; system "ls -l $concat_file"; unlink $concat_file; } { print "*** RECORD MODE ***\n"; my $t0 = time; my ($t1, $t2); open my $out_fh, '>', $concat_file; for my $file_number (1 .. $files_to_concat) { $t1 = time; open my $in_fh, '<', $source_file; print {$out_fh} $_ while <$in_fh>; $t2 = time; print "Duration (file $file_number): ", $t2 - $t1, " seconds\n +"; } print 'Total Duration: ', $t2 - $t0, " seconds\n"; print 'Average Duration: ', ($t2 - $t0) / $files_to_concat, " seco +nds\n"; print "Concat file:\n"; system "ls -l $concat_file"; unlink $concat_file; }

    I ran this a few times. Here's a representative output:

    $ pm_1171789_concat_files.pl Source file: -rw-r--r-- 1 ken staff 1000000000 8 Feb 2013 /Users/ken/local/dev +/test_data/text_G_1 *** SLURP MODE *** Duration (file 1): 10.8527989387512 seconds Duration (file 2): 10.3008499145508 seconds Duration (file 3): 9.28949189186096 seconds Duration (file 4): 9.32191705703735 seconds Duration (file 5): 9.52065896987915 seconds Duration (file 6): 8.99451112747192 seconds Duration (file 7): 8.9400429725647 seconds Duration (file 8): 11.1913599967957 seconds Duration (file 9): 8.92997598648071 seconds Duration (file 10): 9.06094002723694 seconds Total Duration: 96.4215109348297 seconds Average Duration: 9.64215109348297 seconds Concat file: -rw-r--r-- 1 ken staff 10000000000 15 Sep 09:25 pm_1171789_concat_f +iles.out *** RECORD MODE *** Duration (file 1): 8.15429711341858 seconds Duration (file 2): 8.53961801528931 seconds Duration (file 3): 8.15934777259827 seconds Duration (file 4): 9.53512001037598 seconds Duration (file 5): 11.1551628112793 seconds Duration (file 6): 12.2594971656799 seconds Duration (file 7): 10.3394339084625 seconds Duration (file 8): 10.799164056778 seconds Duration (file 9): 11.6248168945312 seconds Duration (file 10): 12.0355160236359 seconds Total Duration: 102.602620124817 seconds Average Duration: 10.2602620124817 seconds Concat file: -rw-r--r-- 1 ken staff 10000000000 15 Sep 09:27 pm_1171789_concat_f +iles.out

    — Ken

Re: Append big files
by Marshall (Canon) on Sep 14, 2016 at 22:17 UTC
    What's wrong with command line?
    cat file1 file2 > all cat file1 >> file2

      Hi

      I'm working in windows. I'm getting some disconfigurations with my MS-DOS system. So I'm using perl in the windows 7 environment (much quicker in almost all the tasks...

      Kepler

        On Windows, use copy. copy file1+file2 Newfile. This file1+file2 syntax is weird, but that is the way it works. A wildcard would also work. copy file* Newfile At the command prompt, type "help copy".

        Update: I just saw the post by BrowserUk. Fine to put in the explicit /b switch although, I believe the default is binary in the first place. I looked for an exact quote from Microsoft to that effect, but couldn't find it. "copy /B file1+file2" result" also set binary for all of the files without having to /B each one. But again, I don't think you have to /B any of them. I have never used the /A option.

        Update: I did find some Microsoft stuff about /b and copy. copy command. Yes, /b (binary is the default). /a is a pretty much worthless critter that will append and extra EOF character (maybe CTL-Z or A?) to the end of the file after the copy. This is certainly not necessary on a Windows file system - it will supply something that PERL recognizes as EOF when the file runs out of bytes. that is the normal way.

        I'm working in windows.

        You have my sympathy. However, there's still the TYPE command available to you there should you choose to use it. Otherwise, and more portably, just use the PPT version of cat.

Re: Append big files -- oneliner
by Discipulus (Canon) on Sep 15, 2016 at 10:30 UTC
    I do not think 1Gb is a problem for a Perl oneliner:

    perl -ple "BEGIN{open $to,'>',shift @ARGV;select $to}" destination. +txt source1.txt source2.txt

    The -p print each line, -l does automatic lines handling; the BEGIN block shift @ARGV using that file as destination, select print everything to the destination.

    PS: if you want something that can print to a destination file or to STDOUT you can modify the above in:

    perl -ple "BEGIN{open $to,'>',shift @ARGV and select $to if $ARGV[0]}" # to STDOUT perl -ple "BEGIN{open $to,'>',shift @ARGV and select $to if $ARGV[0]}" + 0 src1 src2 # to file perl -ple "BEGIN{open $to,'>',shift @ARGV and select $to if $ARGV[0]}" + dst src1 src2

    PPS: maybe this is more intellegible

    perl -ple "BEGIN{open $to,'>',shift @ARGV and select $to unless $ARGV[ +0] eq 'STDOUT'}" # to STDOUT perl -ple "BEGIN{open $to,'>',shift @ARGV and select $to unless $ARGV[ +0] eq 'STDOUT'}" STDOUT src1 src2

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

      Your one-liners can be simplified to just:perl -pe1 file1 file2 file3 > allfiles

      (Or just perl -pe1 file* > allfiles under *nix),

      but it won't be as fast as your local system utility; and if one of the files doesn't contain any newlines, it will get very slow indeed, as it will slurp the entire file before writing it back to disk.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.