G'day kepler,

I ran some tests on this. I used a 1GB source file (text_G_1), consisting of 10,000,000 identical 100 byte records, which I concatenated 10 times to give a 10GB output file. I performed the concatenation two ways:

Which is better rather depends on what you mean by that. I've already addressed speed: is faster better? If slurp mode hogs memory, then record mode might be considered better.

I don't see any issue with accuracy. What sort of inaccuracies did you envisage?

Also note that paragraph mode, or some other type of block mode, may be a better fit for you. Without knowing what your data looks like, it's impossible to tell. See $/ in perlvar for more information about this.

The code I used, along with a representative output, is below. I suggest you run similar tests and then, based on your data, system, and any other local considerations, determine what optimally suits your situation.

#!/usr/bin/env perl use strict; use warnings; use autodie qw{:all}; use Time::HiRes qw{time}; my $source_file = "$ENV{HOME}/local/dev/test_data/text_G_1"; my $concat_file = 'pm_1171789_concat_files.out'; my $files_to_concat = 10; print "Source file:\n"; system "ls -l $source_file"; { print "*** SLURP MODE ***\n"; my $t0 = time; my ($t1, $t2); open my $out_fh, '>', $concat_file; for my $file_number (1 .. $files_to_concat) { local $/; $t1 = time; open my $in_fh, '<', $source_file; print {$out_fh} <$in_fh>; $t2 = time; print "Duration (file $file_number): ", $t2 - $t1, " seconds\n +"; } print 'Total Duration: ', $t2 - $t0, " seconds\n"; print 'Average Duration: ', ($t2 - $t0) / $files_to_concat, " seco +nds\n"; print "Concat file:\n"; system "ls -l $concat_file"; unlink $concat_file; } { print "*** RECORD MODE ***\n"; my $t0 = time; my ($t1, $t2); open my $out_fh, '>', $concat_file; for my $file_number (1 .. $files_to_concat) { $t1 = time; open my $in_fh, '<', $source_file; print {$out_fh} $_ while <$in_fh>; $t2 = time; print "Duration (file $file_number): ", $t2 - $t1, " seconds\n +"; } print 'Total Duration: ', $t2 - $t0, " seconds\n"; print 'Average Duration: ', ($t2 - $t0) / $files_to_concat, " seco +nds\n"; print "Concat file:\n"; system "ls -l $concat_file"; unlink $concat_file; }

I ran this a few times. Here's a representative output:

$ pm_1171789_concat_files.pl Source file: -rw-r--r-- 1 ken staff 1000000000 8 Feb 2013 /Users/ken/local/dev +/test_data/text_G_1 *** SLURP MODE *** Duration (file 1): 10.8527989387512 seconds Duration (file 2): 10.3008499145508 seconds Duration (file 3): 9.28949189186096 seconds Duration (file 4): 9.32191705703735 seconds Duration (file 5): 9.52065896987915 seconds Duration (file 6): 8.99451112747192 seconds Duration (file 7): 8.9400429725647 seconds Duration (file 8): 11.1913599967957 seconds Duration (file 9): 8.92997598648071 seconds Duration (file 10): 9.06094002723694 seconds Total Duration: 96.4215109348297 seconds Average Duration: 9.64215109348297 seconds Concat file: -rw-r--r-- 1 ken staff 10000000000 15 Sep 09:25 pm_1171789_concat_f +iles.out *** RECORD MODE *** Duration (file 1): 8.15429711341858 seconds Duration (file 2): 8.53961801528931 seconds Duration (file 3): 8.15934777259827 seconds Duration (file 4): 9.53512001037598 seconds Duration (file 5): 11.1551628112793 seconds Duration (file 6): 12.2594971656799 seconds Duration (file 7): 10.3394339084625 seconds Duration (file 8): 10.799164056778 seconds Duration (file 9): 11.6248168945312 seconds Duration (file 10): 12.0355160236359 seconds Total Duration: 102.602620124817 seconds Average Duration: 10.2602620124817 seconds Concat file: -rw-r--r-- 1 ken staff 10000000000 15 Sep 09:27 pm_1171789_concat_f +iles.out

— Ken


In reply to Re: Append big files by kcott
in thread Append big files by kepler

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.