combining 2 files with a comon field

jjohhn has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: combining 2 files with a comon field by jmcnamara (Monsignor) on May 18, 2005 at 10:34 UTC
The unix utility `join` will do this (assuming that file1 and file2 are sorted): `$ join -t\\| file1 file2 A1\|dog\|\|Fido\| A2\|cat\|\|Fluffy\| A3\|bird\|\|Tweety\|` [download] From which you could filter out the extra empty field or extend the command line options to select specific fields: `$ join -t\\| -o 1.1 1.2 2.2 2.3 file1 file2 A1\|dog\|Fido\| A2\|cat\|Fluffy\| A3\|bird\|Tweety\|` [download] -- John.	[reply] [d/l] [select]
Re: combining 2 files with a comon field by bart (Canon) on May 18, 2005 at 10:55 UTC
Use a hash, for example a hash of arrays, and fill it with the two files. Finally, print it out. `#! perl -w my %data; open IN, "file1.txt" or die "Ugh! $!"; while(<IN>) { chomp; my($key, $value) = split /\\|/ or next; $data{$key}[0] = $value; } open IN, "file2.txt" or die "Ugh! $!"; while(<IN>) { chomp; my($key, $value) = split /\\|/ or next; $data{$key}[1] = $value; } { open OUT, ">file3.txt" or die "Ugh! $!"; local($\, $,) = ("\|\n", "\|"); local $^W; # avoid "use of uninitialized value" foreach (sort keys %data) { print OUT $_, @{$data{$_}}[0, 1]; } }` [download] The "`or next`" is to skip any empty lines in the input files. Disabling warnings in the printout is done to ignore warnings on any partly incomplete records.	[reply] [d/l]
Re^2: combining 2 files with a comon field by jjohhn (Scribe) on May 18, 2005 at 11:47 UTC
When I tried this (before seeing your post), I recieved the error: syntax error at C:\scripts\combineCols.pl line 10, near ") {" Can't use global $_ in "my" at C:\scripts\combineCols.pl line 12, near "= $_ Do I have to hard code the names of the files? `use strict; my %hash; while(<>){ (my $first, my $second) = split("\|",$_); $hash{$first} = $second; } my $second while(<>) { my $line = $_ (my $first, $second) = split("\|", $line); } foreach my $key (keys %hash){ my @list = ($hash{$key}, $second); $joined = "@list"; $hash{$key} = $joined; }` [download]	[reply] [d/l]
Re^3: combining 2 files with a comon field by bart (Canon) on May 18, 2005 at 11:56 UTC
You forgot a semicolon on the line `my $second` [download] oh, and on the line `my $line = $_` [download] too. That would solve your immediate syntax problem. But it doesn't solve the semantic problem: that it doesn't do what you want. For example, there is no connection between your value for `$second` and your hash key. That connection is in the value for `$first`. It'd work somewhat better if you incorporated your final loop body (but without the loop) into the one reading the second file. And you're making the classic newbie error of not backwhacking the "\|" for split — and it's easier to use a regex for split, otherwise you'd even have to double the backslash. `while(<>) { my $line = $_; my($first, $second) = split(/\\|/, $line); # or: "\\\|" my @list = ($hash{$first}, $second); $joined = "@list"; # joins with space by default (see $" ) $hash{$first} = $joined; }` [download]	[reply] [d/l] [select]
Re^3: combining 2 files with a comon field by TheStudent (Scribe) on May 18, 2005 at 12:03 UTC
Missing semicolon after: my second and after my line = $_ why not just: `(my $first, $second) = split/\\|/;` [download]	[reply] [d/l]
Re: combining 2 files with a comon field by ZlR (Chaplain) on May 18, 2005 at 10:58 UTC
Hello jjohhn , This might be a little 'quick and dirty' but it does what you want, using a hash : `use strict ; use warnings ; my @file1 = ( 'A1\|dog\|', 'A2\|cat\|','A3\|bird\|' ) ; my @file2 = ( 'A1\|Fido\|','A2\|Fluffy\|','A3\|Tweety\|') ; my %rez ; foreach my $it (@file1) { my @input = split /\\|/ , $it ; $rez{$input[0]} = join "\|" ,"", @input[1,] ; } foreach my $it (@file2) { my @input = split /\\|/ , $it ; $rez{$input[0]} .= join "\|" ,"", @input[1,] ; } print $_, $rez{$_}, "\n" foreach ( sort keys %rez ) ;` [download] Hope this helps :) zlr_	[reply] [d/l]
Re: combining 2 files with a comon field by ghenry (Vicar) on May 18, 2005 at 10:58 UTC
I think you are looking for split or -F on the command line (see perlrun) and then you can write out the things you want to a new file. You can split on \| If you need an example after reading the split page and the open tutorial, just come back here with what you've tried ;-) HTH. Walking the road to enlightenment... I found a penguin and a camel on the way..... Fancy a yourname@perl.me.uk? Just ask!!!	[reply]
Re: combining 2 files with a comon field by anotherstevew (Initiate) on May 18, 2005 at 14:05 UTC
<perl newbie puts head above parapet for first time - cautiously - to humbly offer an approach that doesn't use arrays (so it can handle BIG files) and ignores keys only present in one of the input files> #!/usr/bin/perl -w use strict; open ONE, "1.txt" or die "Cannot open 1.txt to read\n $!"; open TWO, "2.txt" or die "Cannot open 2.txt to read\n $!"; open TRE, ">3.txt" or die "Cannot open 2.txt to write\n $!"; while (<ONE>) { chomp; (my $onea, my $oneb) = split(/\\|/); my $twoa = undef; my $twob = undef; while (! eof(TWO)) { my $two = <TWO>; chomp $two; ($twoa, $twob) = split(/\\|/, $two); last if ($twoa ge $onea); } if ($onea lt $twoa) { next; } else { print TRE "$onea\\|$oneb\\|$twob\\|\n" if ($onea eq $twoa); } } [download]	[reply] [d/l]
Re^2: combining 2 files with a comon field by merzy (Scribe) on May 19, 2005 at 01:35 UTC
Yea, I'm a big fan of this method. Memory-gentle, constant time, easy to understand. We use it a lot at work for files with tens of millions of lines. Note that it assumes sorted input files, but that's what sort(1) is for. :-)	[reply]
Re: combining 2 files with a comon field by mattk (Pilgrim) on May 18, 2005 at 13:03 UTC
This is assuming the files have the same amount of lines: `use IO::File; $f1 = new IO::File "< file1"; $f2 = new IO::File "< file2"; $f3 = new IO::File "> file3"; while (my ($c1, $c2) = map { m/^.?\\|(.)\\|$/ } ($f1->getline, $f2->ge +tline)) { print $f3 "A$.\|$c1\|$c2\|\n"; last if eof; }` [download] Reads in one line from each file, extracts the column values using map and a regex, and then prints out a new line containing both column values, plus a header made from $.	[reply] [d/l]
Re^2: combining 2 files with a comon field by jhourcle (Prior) on May 18, 2005 at 13:57 UTC
If we're going with that assumption (same number of lines, and the keys are in the same order), then it's a one liner in unix shell: `paste -d\\| file1 file2 \| cut -d\\| -f1,2,5- > file3` I think I prefer jmcnamara's solution with join, though, as it's more forgiving of bad input.	[reply] [d/l]
Re: combining 2 files with a comon field by radiantmatrix (Parson) on May 18, 2005 at 14:12 UTC
Hm, how about a hash per file and combine them on write? use strict; use warnings; open FILE1, '<', 'file1.txt' or die ($!); open FILE2, '<', 'file2.txt' or die ($!); #- my %file1 = map { split '\\|', $_ } <FILE1>; my %file1 = map { chomp && s/\\|$//g && split '\\|', $_, 2 } <FILE1>; #- my %file2 = map { split '\\|', $_ } <FILE2>; my %file2 = map { chomp && s/\\|$//g && split '\\|', $_, 2 } <FILE2>; ## we now have A1=>'dog' in one hash, and A1=>'Fido' in the other close FILE1; close FILE2; open FILE3, '>', 'file3.txt' or die ($!); for (sort keys %file1) { print FILE3 join('\|',$_,$file1{$_},$file2{$_}),'\|',"\n"; } close FILE3; [download] untested Simply put, create a hash "map" of each file, then find where the keys intersect and print out the result. Caveats: If there is no key in %file1, you won't get a result If there is no key in %file2, you'll get a warning about printing an undefined value. this makes assumptions about file formats Update:Files that are not well-formed will cause problems -- for this and other reasons, there needs to be better error-checking. None of those are unconquerable, but are some things to consider if you're taking the idea for production code. Update: modified the code based on thread below. Comments '#-' are old lines. A better thing to do than cheat with the file slurp might be something like: `my %file1; while (<FILE1>) { chomp; s/\\|[\s]*$//; my ($key, $val) = split '\\|', $_, 2; $file1{$key} = $val; }` [download] Of course, that's not nearly as fun... The Eightfold Path: 'use warnings;', 'use strict;', 'use diagnostics;', perltidy, CGI or CGI::Simple, try the CPAN first, big modules and small scripts, test first.	[reply] [d/l] [select]
Re^2: combining 2 files with a comon field by jjohhn (Scribe) on May 18, 2005 at 16:18 UTC
I tried a variation of your suggestion, with an added print debug line: `use strict; use warnings; open FILE1, '<', 'file1' or die ($!); my %file1 = map { split '\\|', $_ } <FILE1>; ## we now have A1=>'dog' in hash close FILE1; for(sort keys %file1){ print "$_\n"; print join('\|',$_,$file1{$_}),'\|',"\n"; }` [download] and got: Odd number of elements in hash assignment at combine2.pl line 6, <FILE1> line 3. Use of uninitialized value in join or string at combine2.pl line 13. \|\| A1 A1\|dog\| A3 A3\|bird\| cat cat\| \| file1 is: A1\|dog\| A2\|cat\| A3\|bird\|	[reply] [d/l]
Re^3: combining 2 files with a comon field by radiantmatrix (Parson) on May 18, 2005 at 20:42 UTC
Well, I did say it was untested. ;-) `# my %file1 = map { split '\\|', $_ } <FILE1>; my %file1 = map { chomp && s/\\|$//g && split '\\|', $_, 2 } <FILE1>;` [download] Do make the same changes to the %file2 hash statement, too, in the original code. This clears the \| and newline at the end of each file line before splitting, and limits the split to two parts. Hope that helps! BTW, this is a good example of how using warnings and strict point out where the bugs are. I knew what the issue was as soon as I saw those two warning statements. ;-) Also, make sure that your file is well-formed: that is, it ends with a newline, or you might get interesting results. This should not be done in production without some better error control... The Eightfold Path: 'use warnings;', 'use strict;', 'use diagnostics;', perltidy, CGI or CGI::Simple, try the CPAN first, big modules and small scripts, test first.	[reply] [d/l]
Re^3: combining 2 files with a comon field by BerniBoy (Acolyte) on May 18, 2005 at 19:43 UTC
Hmm seems to me like the "map" statement also treats the newline characters after the last "\|" character and tries to insert them into the hash somehow. Just a guess from what i see here...	[reply]
Re: combining 2 files with a comon field by Anonymous Monk on May 18, 2005 at 23:17 UTC
Why are you guys replying on school homework? Did something change? :-)	[reply]
Re^2: combining 2 files with a comon field by jjohhn (Scribe) on May 19, 2005 at 02:43 UTC
Whose homework? Don't assume. I have about 10 gigabytes of text I am trying to get my brain around, hampered only by my inexperience with Perl.	[reply]


The stupid question is the question not asked
	PerlMonks