ellem has asked for the wisdom of the Perl Monks concerning the following question:

I have some files that contain similar data.
Something along the lines of:

File1.txt

Name,SSN,File
"Doe John",123456789,3447

File2.txt

Name,Salary,ZIP
"Doe, John",50000,10003

I would like to merge (join?) this data together. I have 7 or 8 files like this. The only constant (unfourtunately) is the Name. Of course as people get married and learn how to spell the Name occasionally changes.

My goal is to write something that will merge the file and create two files: Matches and Exceptions. I have read some things (5.10, 14.4, 14.8) in the Perl Cookbook but I am having a hard time wrapping my head around some of this stuff.

any ideas?
--
lmoran@wtsgSPAM.com
print "\x{263a}"

Replies are listed 'Best First'.
Re: Merging files
by sevensven (Pilgrim) on Oct 26, 2001 at 20:14 UTC

    Here is my no code sugestion, since you seam to be needing more ideas than working code ;-)

    You could make a script that acepts who file names as command line parameters.

    Then proced to parse the first file, in blocks of two lines a piece, puting the info into a hash with the person's name as the hash key.

    Next, parse the second file, and for each name, look it up in the hash key. If it's there, add the second file's info to the hash with that key and put a flag in there marking it as "ok", otherwise print the second files parsed info into a mismatches file.

    The final thing is run the hash, and for each key that doesn't have the ok flag, put it's info into the mismatches file.

    Hope this helps, altough it would not be terribly eficiente for large files, because it builds a hash linearly as big as the first files size.

Re: Merging files
by Fletch (Bishop) on Oct 26, 2001 at 23:40 UTC
    use Text::CSV_XS (); my %data; parse_ssn_file( "file1.txt", \%data ); parse_salary_file( "file2.txt", \%data ); my $csv = Text::CSV_XS->new(); foreach my $cur ( sort keys %data ) { $csv->combine( $cur, @{ $data{$cur} }{qw( ssn file salary zip )} ); print $csv->string, "\n" } exit 0; sub parse_ssn_file { my( $file, $data ) = @_; local( *CUR ); open( CUR, $file ) or die "Can't open $file: $!\n"; scalar <CUR>; # discard header my $csv = Text::CSV_XS->new( ); while( <CUR> ) { $csv->parse( $_ ); my @cur = $csv->fields(); $data->{ $cur[0] }->{ssn} = $cur[1]; $data->{ $cur[0] }->{file} = $cur[2]; } return } sub parse_salary_file { my( $file, $data ) = @_; local( *CUR ); open( CUR, $file ) or die "Can't open $file: $!\n"; scalar <CUR>; # discard header my $csv = Text::CSV_XS->new( ); while( <CUR> ) { $csv->parse( $_ ); my @cur = $csv->fields(); $data->{ $cur[0] }->{salary} = $cur[1]; $data->{ $cur[0] }->{zip} = $cur[2]; } return } __END__
    file1.txt Name,SSN,File "Doe, John",123456789,3447 "Flinstone, Fred",000000302,1234 file2.txt Name,Salary,ZIP "Flinstone, Fred",34500,90210 "Doe, John",50000,10003 Output: "Doe, John",123456789,3447,50000,10003 "Flinstone, Fred",000000302,1234,34500,90210

    Update: D'oh, might help if I answered the whole question. :/ If you want to catch exceptions then you'd want to pick whatever file is going to be canonical and parse it first. Have all the other parse_foo_file routines first do an exists $data->{$cur} to check whether it's a valid, existing name before inserting the data. If it's not, have an open log file (open( REJ, ">$filename.rej" ) or die $!) that you can print the rejected entries into.

      Thanks! This really did clear up a lot of what I was doing wrong.
      Two questions:
      1 - Although you commented it I don't understand what
      scalar <CUR>;
      does.

      2 - Instead of using Names as the key you're using two columns? What would you suggest for files with a greater number of columns? Could I just continue on with $cur[3],[4] etc.?
      --
      lmoran@wtsgSPAM.com
      print "\x{263a}"

        As for number one, that's just throwing away the first line. It's just the normal readline diamond operator in void context (technically the scalar is superfluous, but I put it in out of habit). The return value is being discarded rather than being saved anywhere (my $toss = <CUR> would have done just the same thing with a (useless) temporary).

        As for the data structure, it's a hash of hashes. The top level hash (%data) has keys that are the primary key from your data file (the names), and the corresponding value is a hash of key/value pairs from the various data files. Add something like the following to the end of the program and you can see what it looks like:

        ... use Data::Dumper qw( Dumper ); print Dumper( \%data );

        See also perldoc perldsc.

Re: Merging files
by DamnDirtyApe (Curate) on Oct 26, 2001 at 21:27 UTC

    If you want a bit of conceptual guidance, you may find something useful in Data Munging with Perl. Unfortunately, as I post this, the link to the two sample chapters isn't working. Check back here later, though; there's some good advice about the kind of thing you're talking about, along with some code examples.

    _______________
    D a m n D i r t y A p e
    Home Node | Email