Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

combining 2 files with a comon field

by jjohhn (Scribe)
on May 18, 2005 at 10:20 UTC ( [id://458135]=perlquestion: print w/replies, xml ) Need Help??

jjohhn has asked for the wisdom of the Perl Monks concerning the following question:

I would like to take two files, each with 2 fields:
File1:
A1|dog|
A2|cat|
A3|bird|

File2:
A1|Fido|
A2|Fluffy|
A3|Tweety|

I want to have a single file:
File3
A1|dog|Fido|
A2|cat|Fluffy|
A3|bird|Tweety|

I can't figure out how to iterate through the files "vertically" to make a "horizontal" merge.

Replies are listed 'Best First'.
Re: combining 2 files with a comon field
by jmcnamara (Monsignor) on May 18, 2005 at 10:34 UTC

    The unix utility join will do this (assuming that file1 and file2 are sorted):
    $ join -t\| file1 file2 A1|dog||Fido| A2|cat||Fluffy| A3|bird||Tweety|
    From which you could filter out the extra empty field or extend the command line options to select specific fields:
    $ join -t\| -o 1.1 1.2 2.2 2.3 file1 file2 A1|dog|Fido| A2|cat|Fluffy| A3|bird|Tweety|

    --
    John.

Re: combining 2 files with a comon field
by bart (Canon) on May 18, 2005 at 10:55 UTC
    Use a hash, for example a hash of arrays, and fill it with the two files. Finally, print it out.
    #! perl -w my %data; open IN, "file1.txt" or die "Ugh! $!"; while(<IN>) { chomp; my($key, $value) = split /\|/ or next; $data{$key}[0] = $value; } open IN, "file2.txt" or die "Ugh! $!"; while(<IN>) { chomp; my($key, $value) = split /\|/ or next; $data{$key}[1] = $value; } { open OUT, ">file3.txt" or die "Ugh! $!"; local($\, $,) = ("|\n", "|"); local $^W; # avoid "use of uninitialized value" foreach (sort keys %data) { print OUT $_, @{$data{$_}}[0, 1]; } }

    The "or next" is to skip any empty lines in the input files. Disabling warnings in the printout is done to ignore warnings on any partly incomplete records.

      When I tried this (before seeing your post), I recieved the error:
      syntax error at C:\scripts\combineCols.pl line 10, near ") {" Can't use global $_ in "my" at C:\scripts\combineCols.pl line 12, near "= $_
      Do I have to hard code the names of the files?
      use strict; my %hash; while(<>){ (my $first, my $second) = split("|",$_); $hash{$first} = $second; } my $second while(<>) { my $line = $_ (my $first, $second) = split("|", $line); } foreach my $key (keys %hash){ my @list = ($hash{$key}, $second); $joined = "@list"; $hash{$key} = $joined; }
        You forgot a semicolon on the line
        my $second
        oh, and on the line
        my $line = $_
        too.

        That would solve your immediate syntax problem. But it doesn't solve the semantic problem: that it doesn't do what you want. For example, there is no connection between your value for $second and your hash key. That connection is in the value for $first. It'd work somewhat better if you incorporated your final loop body (but without the loop) into the one reading the second file. And you're making the classic newbie error of not backwhacking the "|" for split — and it's easier to use a regex for split, otherwise you'd even have to double the backslash.

        while(<>) { my $line = $_; my($first, $second) = split(/\|/, $line); # or: "\\|" my @list = ($hash{$first}, $second); $joined = "@list"; # joins with space by default (see $" ) $hash{$first} = $joined; }
        Missing semicolon after:

        my second

        and after

        my line = $_

        why not just:

        (my $first, $second) = split/\|/;
Re: combining 2 files with a comon field
by ZlR (Chaplain) on May 18, 2005 at 10:58 UTC
    Hello jjohhn ,

    This might be a little 'quick and dirty' but it does what you want, using a hash :

    use strict ; use warnings ; my @file1 = ( 'A1|dog|', 'A2|cat|','A3|bird|' ) ; my @file2 = ( 'A1|Fido|','A2|Fluffy|','A3|Tweety|') ; my %rez ; foreach my $it (@file1) { my @input = split /\|/ , $it ; $rez{$input[0]} = join "|" ,"", @input[1,] ; } foreach my $it (@file2) { my @input = split /\|/ , $it ; $rez{$input[0]} .= join "|" ,"", @input[1,] ; } print $_, $rez{$_}, "\n" foreach ( sort keys %rez ) ;
    Hope this helps :)
    zlr_
Re: combining 2 files with a comon field
by ghenry (Vicar) on May 18, 2005 at 10:58 UTC

    I think you are looking for split or -F on the command line (see perlrun) and then you can write out the things you want to a new file.

    You can split on |

    If you need an example after reading the split page and the open tutorial, just come back here with what you've tried ;-)

    HTH.

    Walking the road to enlightenment... I found a penguin and a camel on the way.....
    Fancy a yourname@perl.me.uk? Just ask!!!
Re: combining 2 files with a comon field
by anotherstevew (Initiate) on May 18, 2005 at 14:05 UTC
    <perl newbie puts head above parapet for first time - cautiously - to humbly offer an approach that doesn't use arrays (so it can handle BIG files) and ignores keys only present in one of the input files>
    #!/usr/bin/perl -w use strict; open ONE, "1.txt" or die "Cannot open 1.txt to read\n $!"; open TWO, "2.txt" or die "Cannot open 2.txt to read\n $!"; open TRE, ">3.txt" or die "Cannot open 2.txt to write\n $!"; while (<ONE>) { chomp; (my $onea, my $oneb) = split(/\|/); my $twoa = undef; my $twob = undef; while (! eof(TWO)) { my $two = <TWO>; chomp $two; ($twoa, $twob) = split(/\|/, $two); last if ($twoa ge $onea); } if ($onea lt $twoa) { next; } else { print TRE "$onea\|$oneb\|$twob\|\n" if ($onea eq $twoa); } }
      Yea, I'm a big fan of this method. Memory-gentle, constant time, easy to understand. We use it a lot at work for files with tens of millions of lines. Note that it assumes sorted input files, but that's what sort(1) is for. :-)
Re: combining 2 files with a comon field
by mattk (Pilgrim) on May 18, 2005 at 13:03 UTC
    This is assuming the files have the same amount of lines:
    use IO::File; $f1 = new IO::File "< file1"; $f2 = new IO::File "< file2"; $f3 = new IO::File "> file3"; while (my ($c1, $c2) = map { m/^.*?\|(.*)\|$/ } ($f1->getline, $f2->ge +tline)) { print $f3 "A$.|$c1|$c2|\n"; last if eof; }
    Reads in one line from each file, extracts the column values using map and a regex, and then prints out a new line containing both column values, plus a header made from $.

      If we're going with that assumption (same number of lines, and the keys are in the same order), then it's a one liner in unix shell:

      paste  -d\| file1 file2 | cut -d\| -f1,2,5- > file3

      I think I prefer jmcnamara's solution with join, though, as it's more forgiving of bad input.

Re: combining 2 files with a comon field
by radiantmatrix (Parson) on May 18, 2005 at 14:12 UTC

    Hm, how about a hash per file and combine them on write?

    use strict; use warnings; open FILE1, '<', 'file1.txt' or die ($!); open FILE2, '<', 'file2.txt' or die ($!); #- my %file1 = map { split '\|', $_ } <FILE1>; my %file1 = map { chomp && s/\|$//g && split '\|', $_, 2 } <FILE1>; #- my %file2 = map { split '\|', $_ } <FILE2>; my %file2 = map { chomp && s/\|$//g && split '\|', $_, 2 } <FILE2>; ## we now have A1=>'dog' in one hash, and A1=>'Fido' in the other close FILE1; close FILE2; open FILE3, '>', 'file3.txt' or die ($!); for (sort keys %file1) { print FILE3 join('|',$_,$file1{$_},$file2{$_}),'|',"\n"; } close FILE3;
    untested

    Simply put, create a hash "map" of each file, then find where the keys intersect and print out the result.

    Caveats:

    • If there is no key in %file1, you won't get a result
    • If there is no key in %file2, you'll get a warning about printing an undefined value.
    • this makes assumptions about file formats
    • Update:Files that are not well-formed will cause problems -- for this and other reasons, there needs to be better error-checking.
    None of those are unconquerable, but are some things to consider if you're taking the idea for production code. Update: modified the code based on thread below. Comments '#-' are old lines. A better thing to do than cheat with the file slurp might be something like:
    my %file1; while (<FILE1>) { chomp; s/\|[\s]*$//; my ($key, $val) = split '\|', $_, 2; $file1{$key} = $val; }
    Of course, that's not nearly as fun...

    The Eightfold Path: 'use warnings;', 'use strict;', 'use diagnostics;', perltidy, CGI or CGI::Simple, try the CPAN first, big modules and small scripts, test first.

      I tried a variation of your suggestion, with an added print debug line:
      use strict; use warnings; open FILE1, '<', 'file1' or die ($!); my %file1 = map { split '\|', $_ } <FILE1>; ## we now have A1=>'dog' in hash close FILE1; for(sort keys %file1){ print "$_\n"; print join('|',$_,$file1{$_}),'|',"\n"; }
      and got:
      Odd number of elements in hash assignment at combine2.pl line 6, <FILE1> line 3.
      Use of uninitialized value in join or string at combine2.pl line 13.

      ||
      A1
      A1|dog|
      A3
      A3|bird|
      cat
      cat|
      |

      file1 is:
      A1|dog|
      A2|cat|
      A3|bird|

        Well, I did say it was untested. ;-)

        # my %file1 = map { split '\|', $_ } <FILE1>; my %file1 = map { chomp && s/\|$//g && split '\|', $_, 2 } <FILE1>;
        Do make the same changes to the %file2 hash statement, too, in the original code. This clears the | and newline at the end of each file line before splitting, and limits the split to two parts. Hope that helps!

        BTW, this is a good example of how using warnings and strict point out where the bugs are. I knew what the issue was as soon as I saw those two warning statements. ;-) Also, make sure that your file is well-formed: that is, it ends with a newline, or you might get interesting results.

        This should *not* be done in production without some better error control...


        The Eightfold Path: 'use warnings;', 'use strict;', 'use diagnostics;', perltidy, CGI or CGI::Simple, try the CPAN first, big modules and small scripts, test first.

        Hmm seems to me like the "map" statement also treats the newline characters after the last "|" character and tries to insert them into the hash somehow.

        Just a guess from what i see here...
Re: combining 2 files with a comon field
by Anonymous Monk on May 18, 2005 at 23:17 UTC
    Why are you guys replying on school homework? Did something change? :-)
      Whose homework? Don't assume. I have about 10 gigabytes of text I am trying to get my brain around, hampered only by my inexperience with Perl.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://458135]
Front-paged by tlm
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-04-19 05:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found