iamravikanth has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, Seek your wisdom again. My requirement is as follows. I have the following files
1_ABC.txt 2_ABC.txt 3_ABC.txt 1_XYZ.txt 2.XYZ.txt 5.XYZ.txt
I have to append all files starting with 1 into a single file (1_AppendFile.txt) and starting with 2 into a single file (2_AppendFile.txt) and so on. I could loop through all the files check the starting value of the file and open a filehandle correspondingly and then print the data into the file. But when I consider file numbers close to 10000 it really gets very slow. Is there any other alternative that we can use here. Regards, Ravi.

Replies are listed 'Best First'.
Re: Appending multiple files into one or more files
by Ratazong (Monsignor) on Jul 19, 2010 at 12:30 UTC

    You may use some simple shell-commands:

    cat 1*.txt >> AppendFile_1.txt cat 2*.txt >> AppendFile_2.txt

    Of course you can write a Perl-script for generating and executing the cat commands...

    HTH, Rata
Re: Appending multiple files into one or more files
by roboticus (Chancellor) on Jul 19, 2010 at 13:41 UTC

    lamravikanth:

    Hmmm ... you're complaining that it gets very slow. But a straightforward implementation shouldn't take much time at all except for the actual creation of the appended files. Since you're not presenting any code, I can't tell if you've got an error in your program that causes the low performance or not.

    Try commenting out the part of your program that actually creates the appended data files and measure how long it takes to run with 10000 file numbers. If it runs quickly, then the performance issue is due to the amount of data you're reading and writing. In that case, you may want to use tricks like putting your output files on a different disk drive than your input files to reduce your I/O time.

    If, on the other hand, it takes a long time to run, then I'd expect either:

    • Your algorithm to split your list of filenames into groups to append together has a problem (like nested for loops or some other O(n) issue), or
    • You're using a filesystem that doesn't handle many files in a single directory, and the act of scanning through the directory is taking a long time.

    You can differentiate between these two cases with a script that simply reads all the filenames from the directory and does nothing with them. If that script runs quickly, then you have a problem with your algorithm that splits up your filenames. If it runs very slowly, then you may want to change the filesystem you're using or perhaps partition files up into different subdirectories.


    I did a quick test: I generated roughly 50,000 files and then grouped them by their numeric prefix, then deleted all of the files, like so:

    roboticus@Boink:~/funkytest$ ls genfiles.pl groupfiles.pl roboticus@Boink:~/funkytest$ time ./genfiles.pl real 0m4.937s user 0m1.644s sys 0m3.292s roboticus@Boink:~/funkytest$ ls | wc -l 49993 roboticus@Boink:~/funkytest$ time ./groupfiles.pl 6712: 6712_WRW, 6712_DIK, 6712_FRB 8563: 8563_FHL, 8563_AAE, 8563_TSL, 8563_LCU 5006: 5006_SLA, 5006_PZK, 5006_PUB, 5006_FCK 8434: 8434_HNX, 8434_HPB, 8434_YED, 8434_SCB, 8434_KBS, 8434_CEH, 8434 +_JCH, 8434_NVN, 8434_VPN, 8434_GFM, 8434_BNJ 3509: 3509_EAY, 3509_WNU, 3509_MUI, 3509_NPX, 3509_LHX 7652: 7652_IMC, 7652_GMN 4863: 4863_MTN, 4863_RGD, 4863_BFT, 4863_LSF, 4863_KNJ, 4863_JGE Files: 49991, Groups: 9922 real 0m0.736s user 0m0.664s sys 0m0.080s roboticus@Boink:~/funkytest$ time rm {0,1,2,3,4,5,6,7,8,9}* real 0m6.451s user 0m3.044s sys 0m3.212s roboticus@Boink:~/funkytest$

    As you can see, it takes little time to split the files into groups (on my machine, anyway). If the files had data in them and I did the concatenation you mention, then the runtime would be totally dominated by the act of making the concatenated files.

    ...roboticus

      The code I used (if anyone cares) is:

      genfiles.pl

      #!/usr/bin/perl -w use strict; use warnings; for (1 ... 50000) { my $t = int rand 10000; my $u = join('',('A'..'Z')[rand 26, rand 26, rand 26]); open my $OUF, '>', $t.'_'.$u or die; close $OUF; }

      groupfiles.pl

      #!/usr/bin/perl -w use strict; use warnings; my %filegroups; opendir(my $DIRH, '.') || die "Can't open dir: $!\n"; my $cnt=0; while (my $filename = readdir($DIRH)) { next unless $filename =~ /^(\d+)/; push @{$filegroups{$1}}, $filename; ++$cnt; } my $cnt2=0; for my $grp (keys %filegroups) { print "$grp: ", join(", ", @{$filegroups{$grp}}), "\n"; last if $cnt2++>5; } print "Files: $cnt, Groups: ", scalar(keys %filegroups), "\n";

      ...roboticus

        Hi Roboticus, Thanks for your suggestions, it really helped.
        Regards, Ravi.
Re: Appending multiple files into one or more files
by Corion (Patriarch) on Jul 19, 2010 at 12:25 UTC

    I guess you will have to show us what code you've written. My code would mostly use opendir, readdir.

Re: Appending multiple files into one or more files
by Marshall (Canon) on Jul 19, 2010 at 13:27 UTC
    First, get a list of all the files. Below this is the __DATA__ segment, but for your app, use opendir, readdir.

    Make a hash of array keyed upon the "number" of the file (the digits before the '.' or '_').

    Then for each new numbered "append" file, open a file handle and copy the files to the new file.

    #!/usr/bin/perl -w use strict; my %new_files; while (my $filename = <DATA>) { chomp $filename; my $num = ($filename =~ /^(\d+)/)[0]; #just the first digits push (@{$new_files{$num}},$filename); } foreach my $new_file (keys %new_files) { print "open file for ....$new_file"."_AppendFile.txt\n"; #your code print "put these files in there: \n"; #you have to make a copy loop print "@{$new_files{$new_file}}", "\n"; print "\n"; } =prints: open file for ....1_AppendFile.txt put these files in there: 1_ABC.txt 1_XYZ.txt open file for ....3_AppendFile.txt put these files in there: 3_ABC.txt open file for ....2_AppendFile.txt put these files in there: 2_ABC.txt 2.XYZ.txt open file for ....5_AppendFile.txt put these files in there: 5.XYZ.txt =cut __DATA__ 1_ABC.txt 2_ABC.txt 3_ABC.txt 1_XYZ.txt 2.XYZ.txt 5.XYZ.txt
Re: Appending multiple files into one or more files
by cdarke (Prior) on Jul 19, 2010 at 16:28 UTC
    An alternative method is to use the diamond operator (the ARGV filehandle). Set the input filenames in @ARGV, then the opens are taken care of. For example:
    open (my $fh, '>', '1_AppendFile.txt') or die "etc: $!"; @ARGV = glob('*.txt'); while (<>) { print $fh } close ($fh);
Re: Appending multiple files into one or more files
by Sinistral (Monsignor) on Jul 19, 2010 at 13:34 UTC

    And, if you want, you can always use IO::Cat or File::Cat along with glob to separate the 1_* from 2_* files