Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,
I have a file containing multiple filenames.

Sample format:
This/is/the/full/path/file.abc.part-1.txt This/is/the/full/path/file.abc.part-2.txt This/is/the/full/path/file.abc.part-3.txt This/is/the/full/path/file.def.part-1.txt This/is/the/full/path/file.def.part-2.txt This/is/the/full/path/file.ghi.part-1.txt This/is/the/full/path/file.jkl.part-2.txt This/is/the/full/path/file.mno.part-5.txt
I want to concatenate files that have the same name. For eg.
This/is/the/full/path/file.abc.part-1.txt This/is/the/full/path/file.abc.part-2.txt This/is/the/full/path/file.abc.part-3.txt
These 3 files should be appended together into a file named: This/is/the/full/path/file.abc.MERGED.txt
I think I should split at the . but how to compare between abc and def to put them in different files?
Thank you all in advance.

Replies are listed 'Best First'.
Re: compare and merge files
by bobf (Monsignor) on Feb 05, 2006 at 03:08 UTC

    Based on the wording of your question, I assume you already know how to open files, and read and write to them. It sounds like determining which files to merge is the holdup, so that's what I focused on.

    I hacked this together based on your example data. If you don't want to use a regex a combination of File::Basename, File::Spec, and split would accomplish the same thing.

    This approach simply pulls the filenames apart and uses a hash to keep track of the path, 'group by' field, and 'part' number. After all of the filenames have been read, files to be merged are identified by looking for names that contain more than one 'part' in the array. It's quick and dirty, but gets the job done.

    use strict; use warnings; use Data::Dumper; my @paths = qw( This/is/the/full/path/file.abc.part-1.txt This/is/the/full/path/file.abc.part-2.txt This/is/the/full/path/file.abc.part-3.txt This/is/the/full/path/file.def.part-1.txt This/is/the/full/path/file.def.part-2.txt This/is/the/full/path/file.ghi.part-1.txt This/is/the/full/path/file.jkl.part-2.txt This/is/the/full/path/file.mno.part-5.txt ); my %combo; foreach my $pathfile ( @paths ) { if( $pathfile =~ m/^(.+file\.)(\w+?)\.(part-\d+)\.txt$/ ) { push( @{ $combo{$1}{$2} }, $3 ); } else { warn "$pathfile does not match expected format"; } } foreach my $path ( keys %combo ) { foreach my $type ( keys %{ $combo{$path} } ) { if( scalar @{ $combo{$path}{$type} } > 1 ) { my $newfile = join( '', $path, $type, '.MERGED.txt' ); print "These files go into $newfile:\n"; print ' ', join( ', ', @{ $combo{$path}{$type} } ), "\n" +; } } } print Dumper( \%combo );

    Using the example data from the OP, this outputs:

    These files go into This/is/the/full/path/file.abc.MERGED.txt: part-1, part-2, part-3 These files go into This/is/the/full/path/file.def.MERGED.txt: part-1, part-2 $VAR1 = { 'This/is/the/full/path/file.' => { 'jkl' => [ 'part-2' ], 'abc' => [ 'part-1', 'part-2', 'part-3' ], 'mno' => [ 'part-5' ], 'def' => [ 'part-1', 'part-2' ], 'ghi' => [ 'part-1' ] } };

      May be this is helpful:
      use strict; use warnings; use Data::Dumper; my @paths = qw( This/is/the/full/path/file.abc.part-1.txt This/is/the/full/path/file.abc.part-2.txt This/is/the/full/path/file.abc.part-3.txt This/is/the/full/path/file.def.part-1.txt This/is/the/full/path/file.def.part-2.txt This/is/the/full/path/file.ghi.part-1.txt This/is/the/full/path/file.jkl.part-2.txt This/is/the/full/path/file.mno.part-5.txt ); my %combo; foreach my $pathfile ( @paths ){ if( $pathfile =~ m/^.+file\.\w+?\.part-\d+\.txt$/ ){ (my $key = $pathfile) =~ s/part-\d+\.txt$/MERGED.txt/; push( @{ $combo{$key} }, $pathfile ); } else{ warn "$pathfile does not match expected format"; } } print Dumper(\%combo);
Re: compare and merge files
by graff (Chancellor) on Feb 05, 2006 at 15:34 UTC
    The actual file joining would be easier with the (unix) shell (or the windows equivalent of same, available from cygwin, ATT Research Labs, etc):
    cat /some/full/path/file.abc.part*.txt > /some/full/path/file.abc.MERG +ED.txt
    Although that could go haywire if the sequencing is important and the file names don't sort "naturally" into the correct order -- e.g. if you have names like:
    file.abc.part-1.txt file.abc.part-10.txt file.abc.part-11.txt file.abc.part-2.txt file.abc.part-3.txt ...
    If that's how the file names go, you'll want to use Perl to sort things properly. How about:
    #!/usr/bin/perl use strict; die "Usage: $0 /full/path/to/data/dir\n" unless ( @ARGV == 1 and -d $ARGV[0] ); my $basedir = shift; chdir $basedir or die "chdir $basedir failed: $!"; my @parts = <*.part-*.txt>; # or: # opendir D, "."; # my @parts = grep /\.part-\d+\./, readdir D; # closedir D; my %merged; for my $part ( sort { my ($x) = ($a=~/part-(\d+)/); my ($y) = ($b=~/part-(\d+)/); $x <=> $y } @parts ) { my ( $merge_key ) = ( $part =~ /^(.*)\.part-\d/ ); push @{$merged{$merge_key}}, $part; } for my $mrgfile ( sort keys %merged ) { open( OUT, ">", "$mrgfile.MERGED.txt" ) or die "$mrgfile.MERGED.txt: $!"; local $/; for my $partfile ( @{$merged{$mrgfile}} ) { $_ = do { open( I, $partfile ); <I> } close I; print OUT; } close OUT; }
    (not tested, but should be pretty close to what you want)

      Really even then you can still stick to the shell, it's just a little trickier . . .

      $ /bin/ls foo-*.txt | sort -n -t - +1 -2 foo-1.txt foo-2.txt foo-3.txt foo-4.txt foo-5.txt foo-6.txt foo-7.txt foo-8.txt foo-9.txt foo-10.txt foo-11.txt
Re: compare and merge files
by Adrade (Pilgrim) on Feb 06, 2006 at 02:54 UTC
    perl -e 'for(@ARGV){open(I,"<$_");s/\.[^\.]*\-[^\-]*$/.MERGED.txt/;ope +n(O,">>$_");print O $_ while <I>;close(O);close(I)}' file.abc.part-1. +txt file.abc.part-2.txt file.abc.part-3.txt file.def.part-1.txt file. +def.part-2.txt

    --
    By a scallop's forelocks!

      I'm all for one-liners and golfing, etc, but... Did you know that you can actually include line-feeds when typing in a command-line script like this? You post would work the same if entered as:
      perl -e 'for(@ARGV){ open(I,"<$_"); s/\.[^\.]*\-[^\-]*$/.MERGED.txt/; open(O,">>$_");print O $_ while <I>;close(O); close(I) }' file.abc.part-1.txt file.abc.part-2.txt file.abc.part-3.txt file.de +f.part-1.txt file.def.part-2.txt
      The line still gets a little long when all those file names need to be filled in too. (Why didn't you use "*.part-*" for that?)

      My point is, I've been seeing a lot of this one-liner stuff being offered recently, to people who probably can only scratch their heads and say "That should have been posted in the Obfu wing...".

      You'd get more appreciation (and more people would actually be more likely to learn your nifty tricks) if you made it at least a little more legible.

        I guess I sort of just saw the question and wanted to see if I could quickly pound out an answer. I really should have explained things more, probably, though I didn't really think my result was that obfu. I definitely didn't mean to confuse anyone.

        Also, I didn't use *.part-* since it didn't seem like the OP was looking to merge all the files in the dir, since the existence of file.mno.part-5.txt seems to imply the existence of file.mno.part-1.txt, file.mno.part-2.txt, etc., but only the former was listed. Incidentally, I just noticed that the OP wanted to use a file as input, in which case she or he could just replace for(@ARGV){ with while(<>){chomp; and use the file with the file locations as the only argument. For those that don't know, the diamond thingie is the magical input dingdong that pulls in input from the files listed in the arguments. When used in a while context like this, its the same as saying while (defined($_ = <>)) { - which is sort of magical too.

        Anyway, Graff, thanks for pulling me back down to earth a little to sort of remember that we're all here to share and learn. Of course, apologies to any who may have been confused.

        Best,
          -Adam

        P.S. To anyone interested, you can also remove the close()s, since open() close()s filehandles of the same name should they be open, and any filehandles remaining open are closed at the scripts end.

        --
        By a scallop's forelocks!