kepler has asked for the wisdom of the Perl Monks concerning the following question:
Hi
I have several files which I want to concatenate in a single file. My problem is that they are big (1 Gigabyte each). What might be the better, accurate and fastest process to do this? Thanks in advance.
Kepler
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Append big files
by BrowserUk (Patriarch) on Sep 14, 2016 at 23:01 UTC | |
The fastest way under Windows is: copy file1/b + file2/b + file3/b allfiles/b If you can arrange for allfiles to be on a different device (drive) to the rest -- eg. create it on an ssd -- then it will be quicker than if all the files are on the same device(*). * not necessarily true if the one device is actually a set of raided multiple drives. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
Re: Append big files
by kcott (Archbishop) on Sep 15, 2016 at 00:32 UTC | |
G'day kepler, I ran some tests on this. I used a 1GB source file (text_G_1), consisting of 10,000,000 identical 100 byte records, which I concatenated 10 times to give a 10GB output file. I performed the concatenation two ways: Which is better rather depends on what you mean by that. I've already addressed speed: is faster better? If slurp mode hogs memory, then record mode might be considered better. I don't see any issue with accuracy. What sort of inaccuracies did you envisage? Also note that paragraph mode, or some other type of block mode, may be a better fit for you. Without knowing what your data looks like, it's impossible to tell. See $/ in perlvar for more information about this. The code I used, along with a representative output, is below. I suggest you run similar tests and then, based on your data, system, and any other local considerations, determine what optimally suits your situation.
I ran this a few times. Here's a representative output:
— Ken | [reply] [d/l] [select] |
|
Re: Append big files
by Marshall (Canon) on Sep 14, 2016 at 22:17 UTC | |
| [reply] [d/l] |
by kepler (Scribe) on Sep 14, 2016 at 22:31 UTC | |
Hi I'm working in windows. I'm getting some disconfigurations with my MS-DOS system. So I'm using perl in the windows 7 environment (much quicker in almost all the tasks...Kepler | [reply] |
by Marshall (Canon) on Sep 15, 2016 at 01:42 UTC | |
Update: I just saw the post by BrowserUk. Fine to put in the explicit /b switch although, I believe the default is binary in the first place. I looked for an exact quote from Microsoft to that effect, but couldn't find it. "copy /B file1+file2" result" also set binary for all of the files without having to /B each one. But again, I don't think you have to /B any of them. I have never used the /A option. Update: I did find some Microsoft stuff about /b and copy. copy command. Yes, /b (binary is the default). /a is a pretty much worthless critter that will append and extra EOF character (maybe CTL-Z or A?) to the end of the file after the copy. This is certainly not necessary on a Windows file system - it will supply something that PERL recognizes as EOF when the file runs out of bytes. that is the normal way. | [reply] [d/l] [select] |
by hippo (Archbishop) on Sep 14, 2016 at 22:55 UTC | |
I'm working in windows. You have my sympathy. However, there's still the TYPE command available to you there should you choose to use it. Otherwise, and more portably, just use the PPT version of cat. | [reply] |
by Discipulus (Canon) on Sep 15, 2016 at 10:21 UTC | |
|
Re: Append big files -- oneliner
by Discipulus (Canon) on Sep 15, 2016 at 10:30 UTC | |
The -p print each line, -l does automatic lines handling; the BEGIN block shift @ARGV using that file as destination, select print everything to the destination. PS: if you want something that can print to a destination file or to STDOUT you can modify the above in:
PPS: maybe this is more intellegible
L*
There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS. | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Sep 15, 2016 at 11:58 UTC | |
Your one-liners can be simplified to just:perl -pe1 file1 file2 file3 > allfiles (Or just perl -pe1 file* > allfiles under *nix), but it won't be as fast as your local system utility; and if one of the files doesn't contain any newlines, it will get very slow indeed, as it will slurp the entire file before writing it back to disk. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |