compare and merge files

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: compare and merge files by bobf (Monsignor) on Feb 05, 2006 at 03:08 UTC
Based on the wording of your question, I assume you already know how to open files, and read and write to them. It sounds like determining which files to merge is the holdup, so that's what I focused on. I hacked this together based on your example data. If you don't want to use a regex a combination of File::Basename, File::Spec, and split would accomplish the same thing. This approach simply pulls the filenames apart and uses a hash to keep track of the path, 'group by' field, and 'part' number. After all of the filenames have been read, files to be merged are identified by looking for names that contain more than one 'part' in the array. It's quick and dirty, but gets the job done. use strict; use warnings; use Data::Dumper; my @paths = qw( This/is/the/full/path/file.abc.part-1.txt This/is/the/full/path/file.abc.part-2.txt This/is/the/full/path/file.abc.part-3.txt This/is/the/full/path/file.def.part-1.txt This/is/the/full/path/file.def.part-2.txt This/is/the/full/path/file.ghi.part-1.txt This/is/the/full/path/file.jkl.part-2.txt This/is/the/full/path/file.mno.part-5.txt ); my %combo; foreach my $pathfile ( @paths ) { if( $pathfile =~ m/^(.+file\.)(\w+?)\.(part-\d+)\.txt$/ ) { push( @{ $combo{$1}{$2} }, $3 ); } else { warn "$pathfile does not match expected format"; } } foreach my $path ( keys %combo ) { foreach my $type ( keys %{ $combo{$path} } ) { if( scalar @{ $combo{$path}{$type} } > 1 ) { my $newfile = join( '', $path, $type, '.MERGED.txt' ); print "These files go into $newfile:\n"; print ' ', join( ', ', @{ $combo{$path}{$type} } ), "\n" +; } } } print Dumper( \%combo ); [download] Using the example data from the OP, this outputs: `These files go into This/is/the/full/path/file.abc.MERGED.txt: part-1, part-2, part-3 These files go into This/is/the/full/path/file.def.MERGED.txt: part-1, part-2 $VAR1 = { 'This/is/the/full/path/file.' => { 'jkl' => [ 'part-2' ], 'abc' => [ 'part-1', 'part-2', 'part-3' ], 'mno' => [ 'part-5' ], 'def' => [ 'part-1', 'part-2' ], 'ghi' => [ 'part-1' ] } };` [download]	[reply] [d/l] [select]
Re^2: compare and merge files by reneeb (Chaplain) on Feb 05, 2006 at 07:50 UTC
May be this is helpful: use strict; use warnings; use Data::Dumper; my @paths = qw( This/is/the/full/path/file.abc.part-1.txt This/is/the/full/path/file.abc.part-2.txt This/is/the/full/path/file.abc.part-3.txt This/is/the/full/path/file.def.part-1.txt This/is/the/full/path/file.def.part-2.txt This/is/the/full/path/file.ghi.part-1.txt This/is/the/full/path/file.jkl.part-2.txt This/is/the/full/path/file.mno.part-5.txt ); my %combo; foreach my $pathfile ( @paths ){ if( $pathfile =~ m/^.+file\.\w+?\.part-\d+\.txt$/ ){ (my $key = $pathfile) =~ s/part-\d+\.txt$/MERGED.txt/; push( @{ $combo{$key} }, $pathfile ); } else{ warn "$pathfile does not match expected format"; } } print Dumper(\%combo); [download]	[reply] [d/l]
Re: compare and merge files by graff (Chancellor) on Feb 05, 2006 at 15:34 UTC
The actual file joining would be easier with the (unix) shell (or the windows equivalent of same, available from cygwin, ATT Research Labs, etc): `cat /some/full/path/file.abc.part.txt > /some/full/path/file.abc.MERG +ED.txt` [download] Although that could go haywire if the sequencing is important and the file names don't sort "naturally" into the correct order -- e.g. if you have names like: `file.abc.part-1.txt file.abc.part-10.txt file.abc.part-11.txt file.abc.part-2.txt file.abc.part-3.txt ...` [download] If that's how the file names go, you'll want to use Perl to sort things properly. How about: #!/usr/bin/perl use strict; die "Usage: $0 /full/path/to/data/dir\n" unless ( @ARGV == 1 and -d $ARGV[0] ); my $basedir = shift; chdir $basedir or die "chdir $basedir failed: $!"; my @parts = <.part-.txt>; # or: # opendir D, "."; # my @parts = grep /\.part-\d+\./, readdir D; # closedir D; my %merged; for my $part ( sort { my ($x) = ($a=~/part-(\d+)/); my ($y) = ($b=~/part-(\d+)/); $x <=> $y } @parts ) { my ( $merge_key ) = ( $part =~ /^(.)\.part-\d/ ); push @{$merged{$merge_key}}, $part; } for my $mrgfile ( sort keys %merged ) { open( OUT, ">", "$mrgfile.MERGED.txt" ) or die "$mrgfile.MERGED.txt: $!"; local $/; for my $partfile ( @{$merged{$mrgfile}} ) { $_ = do { open( I, $partfile ); <I> } close I; print OUT; } close OUT; } [download] (not tested, but should be pretty close to what you want)	[reply] [d/l] [select]
Re^2: compare and merge files by Fletch (Bishop) on Feb 05, 2006 at 18:59 UTC
Really even then you can still stick to the shell, it's just a little trickier . . . `$ /bin/ls foo-*.txt \| sort -n -t - +1 -2 foo-1.txt foo-2.txt foo-3.txt foo-4.txt foo-5.txt foo-6.txt foo-7.txt foo-8.txt foo-9.txt foo-10.txt foo-11.txt` [download]	[reply] [d/l]
Re: compare and merge files by Adrade (Pilgrim) on Feb 06, 2006 at 02:54 UTC
`perl -e 'for(@ARGV){open(I,"<$_");s/\.[^\.]\-[^\-]$/.MERGED.txt/;ope +n(O,">>$_");print O $_ while <I>;close(O);close(I)}' file.abc.part-1. +txt file.abc.part-2.txt file.abc.part-3.txt file.def.part-1.txt file. +def.part-2.txt` [download] -- By a scallop's forelocks!	[reply] [d/l]
Re^2: compare and merge files by graff (Chancellor) on Feb 06, 2006 at 06:12 UTC
I'm all for one-liners and golfing, etc, but... Did you know that you can actually include line-feeds when typing in a command-line script like this? You post would work the same if entered as: `perl -e 'for(@ARGV){ open(I,"<$_"); s/\.[^\.]\-[^\-]$/.MERGED.txt/; open(O,">>$_");print O $_ while <I>;close(O); close(I) }' file.abc.part-1.txt file.abc.part-2.txt file.abc.part-3.txt file.de +f.part-1.txt file.def.part-2.txt` [download] The line still gets a little long when all those file names need to be filled in too. (Why didn't you use ".part-" for that?) My point is, I've been seeing a lot of this one-liner stuff being offered recently, to people who probably can only scratch their heads and say "That should have been posted in the Obfu wing...". You'd get more appreciation (and more people would actually be more likely to learn your nifty tricks) if you made it at least a little more legible.	[reply] [d/l]
Re^3: compare and merge files by Adrade (Pilgrim) on Feb 06, 2006 at 12:05 UTC
I guess I sort of just saw the question and wanted to see if I could quickly pound out an answer. I really should have explained things more, probably, though I didn't really think my result was that obfu. I definitely didn't mean to confuse anyone. Also, I didn't use .part- since it didn't seem like the OP was looking to merge all the files in the dir, since the existence of file.mno.part-5.txt seems to imply the existence of file.mno.part-1.txt, file.mno.part-2.txt, etc., but only the former was listed. Incidentally, I just noticed that the OP wanted to use a file as input, in which case she or he could just replace `for(@ARGV){` with `while(<>){chomp;` and use the file with the file locations as the only argument. For those that don't know, the diamond thingie is the magical input dingdong that pulls in input from the files listed in the arguments. When used in a while context like this, its the same as saying `while (defined($_ = <>)) {` - which is sort of magical too. Anyway, Graff, thanks for pulling me back down to earth a little to sort of remember that we're all here to share and learn. Of course, apologies to any who may have been confused. Best, -Adam P.S. To anyone interested, you can also remove the `close()`s, since `open()` `close()`s filehandles of the same name should they be open, and any filehandles remaining open are closed at the scripts end. -- By a scallop's forelocks!	[reply] [d/l] [select]