issue with output of file matching

rruser has asked for the wisdom of the Perl Monks concerning the following question:

I recieved some great help matching data from 2 files, I now need to append a carryover file to file1. When I append data to file1 (carry over data from previous month it contains -C) it does not match up with the car number. I am not sure if it is the formatting in the carryover file. Thanks for your help...

output file:
ACFX  28523  L           05/18/13  ABCCO 
ACFX  28526  L           05/28/13  ABCCO    
ACFX  44866  L           05/28/13  ABCCO    
ACFX 28523  L 05/01/13           ABCCO-C 
ACFX 28526  L 05/01/13           ABCCO-C 
ACFX 44866  L 05/01/13           ABCCO-C 
ADMX  49266  L  05/03/13 05/16/13  PFGCO       
ADMX  63770  L  05/12/13 05/21/13  PFGCO       
ADMX  63975  L  05/12/13 05/30/13  PFGCO

file1:
ADMX  49266, L , 05/03/13, PFGCO
ADMX  63770, L , 05/12/13, PFGCO 
ADMX  63975, L , 05/12/13, PFGCO 

file2:
ADMX  49266, L , 05/16/13, PFGCO
ACFX  28523, L , 05/18/13, ABCCO 
ADMX  63770, L , 05/21/13, PFGCO
ACFX  28526, L , 05/28/13, ABCCO   
ACFX  44866, L , 05/28/13, ABCCO
ADMX  63975, L , 05/30/13, PFGCO

carry-over file (I append to bottom of file1)
ACFX 28523, L, 05/01/13, ABCCO-C
ACFX 28526, L, 05/01/13, ABCCO-C
ACFX 44866, L, 05/01/13, ABCCO-C

I need the output file to read like this:
ACFX  28523  L  05/01/13 05/18/13  ABCCO-C 
ACFX  28526  L  05/01/13 05/28/13  ABCCO-C   
ACFX  44866  L  05/01/13 05/28/13  ABCCO-C    
ADMX  49266  L  05/03/13 05/16/13  PFGCO       
ADMX  63770  L  05/12/13 05/21/13  PFGCO       
ADMX  63975  L  05/12/13 05/30/13  PFGCO


#!/usr/bin/perl
# 
use strict;
use warnings;
use Date::Calc qw( Delta_Days );

my %hash=();
input_data(1,'out1.txt');
input_data(2,'out2.txt');
#input_data(1,'pcarry.txt');
#input_data(2,'rcarry.txt');
output_data('final.txt');

sub input_data {
  my ($ix,$filename) = @_;
  open FILE1, "<", $filename or die "$filename : $!\n";
  while ( <FILE1> ) {
    chomp $_;
    my ( $key, $le, $date, $company ) = split ',', $_;
    my $pk = join "\t",$key,$le,$company;
    push @{$hash{$pk}[$ix]},fmt_ymd($date);
  }
  close FILE1;
}

sub output_data {
  my $filename = shift;
  open OUTFILE, ">", $filename or die "$filename : $!\n";

  # primary key
  for my $pk (sort keys %hash){
    my ($key,$le,$company) = split "\t",$pk;

    # get multiple dates
#    print "$pk\n";
#   my @dates  = @{$hash{$pk}[1]};
#    my @rdates = @{$hash{$pk}[2]};
    my @dates  = (defined $hash{$pk}[1]) ? @{$hash{$pk}[1]} : ();
    my @rdates = (defined $hash{$pk}[2]) ? @{$hash{$pk}[2]} : ();
    
    # even up number of dates:
    while (@dates < @rdates) {
      push @dates,'1900-01-01';
    }
    while (@rdates < @dates) {
      push @rdates,'1900-01-01';
    }
    
    # print out multiple dates for each key
    for my $date (reverse sort @dates){
    
      # use match sub if more than 1
      if (@rdates > 1){
          @rdates = match($date,@rdates);
      }
      # rdates sorted so best match is first element
      my $rdate = shift @rdates;
      print join ' ',$key,$le,fmt_mdy($date),fmt_mdy($rdate),$company,
+"\n";
    }
  }
  close OUTFILE;
}

# match dates by calc days diff
# and sorting to get least diff
sub match {
  my ($date,@rdates) = @_;
  my @days=();
  # split date into y,m,d
  my @d1 = split /\D/,$date;
  
  # calc diff and store with date
  for my $rdate (@rdates){
    my @d2 = split /\D/,$rdate;
    push @days,[$rdate,abs Delta_Days(@d1,@d2)];
  }
  
  # sort array by days
  @days = sort {$a->[1] <=> $b->[1]} @days;
  
  # extract dates 
  return map {$_->[0]} @days;
}
[download]

Comment on issue with output of file matching Download Code

Replies are listed 'Best First'.
Re: issue with output of file matching by poj (Abbot) on Jun 07, 2013 at 19:44 UTC
If you want for example `ABCCO` to match `ABCCO-C` then you must remove the -C from the matching key and then add it back into the report after the match. You need to ensure there are no trailing spaces in the data after the -C or modify the regex to deal with them. # Define a hash for pk that have -C removed my %hashC=(); . . sub input_data { my ($ix,$filename) = @_; open FILE1, "<", $filename or die "$filename : $!\n"; while ( <FILE1> ) { chomp $_; my ( $key, $le, $date, $company ) = split ',', $_; my $pk = join "\t",$key,$le,$company; # remove -C from key and store if ($pk =~ s/-C$//){ # print "-C removed $pk\n"; $hashC{$pk} = '-C'; } push @{$hash{$pk}[$ix]},fmt_ymd($date); } close FILE1; } sub output_data { my $filename = shift; open OUTFILE, ">", $filename or die "$filename : $!\n"; # primary key for my $pk (sort keys %hash){ my ($key,$le,$company) = split "\t",$pk; # add -C back if required $company .= $hashC{$pk} \|\| ''; . . . [download] Note : The spacing of the columns in the carry over file example is different from file 1 and 2 example. Only one space shown between column1 and 2 and no space after L. If the real data is like this you may have more problems. `ACFX 28523, L, 05/01/13, ABCCO-C` [download] poj	[reply] [d/l] [select]
Re^2: issue with output of file matching by rruser (Acolyte) on Jun 07, 2013 at 20:17 UTC
Thanks so much for the help again, I replace the sub input_data and sub output_data in original script with this and I get the following: `ACFX 28523 L 05/18/13 ABCCO ACFX 28523 L 05/01/13 ABCCO-C` [download] I am pasting entire script, maybe I am doing something wrong, this is a little over my head, hopefully will begin understanding better #!/usr/bin/perl # use strict; use warnings; use Date::Calc qw( Delta_Days ); my %hashC=(); my %hash=(); input_data(1,'out1.txt'); input_data(2,'out2.txt'); #input_data(1,'pcarry.txt'); #input_data(2,'rcarry.txt'); output_data('final.txt'); #sub input_data { # my ($ix,$filename) = @_; # open FILE1, "<", $filename or die "$filename : $!\n"; # while ( <FILE1> ) { # chomp $_; # my ( $key, $le, $date, $company ) = split ',', $_; # my $pk = join "\t",$key,$le,$company; # push @{$hash{$pk}[$ix]},fmt_ymd($date); # } # close FILE1; #} sub input_data { my ($ix,$filename) = @_; open FILE1, "<", $filename or die "$filename : $!\n"; while ( <FILE1> ) { chomp $_; my ( $key, $le, $date, $company ) = split ',', $_; my $pk = join "\t",$key,$le,$company; # remove -C from key and store if ($pk =~ s/-C$//){ # print "-C removed $pk\n"; $hashC{$pk} = '-C'; } push @{$hash{$pk}[$ix]},fmt_ymd($date); } close FILE1; } #sub output_data { # my $filename = shift; # open OUTFILE, ">", $filename or die "$filename : $!\n"; # primary key # for my $pk (sort keys %hash){ # my ($key,$le,$company) = split "\t",$pk; sub output_data { my $filename = shift; open OUTFILE, ">", $filename or die "$filename : $!\n"; # primary key for my $pk (sort keys %hash){ my ($key,$le,$company) = split "\t",$pk; # add -C back if required $company .= $hashC{$pk} \|\| ''; # get multiple dates # print "$pk\n"; # my @dates = @{$hash{$pk}[1]}; # my @rdates = @{$hash{$pk}[2]}; my @dates = (defined $hash{$pk}[1]) ? @{$hash{$pk}[1]} : (); my @rdates = (defined $hash{$pk}[2]) ? @{$hash{$pk}[2]} : (); # even up number of dates: while (@dates < @rdates) { push @dates,'1900-01-01'; } while (@rdates < @dates) { push @rdates,'1900-01-01'; } # print out multiple dates for each key for my $date (reverse sort @dates){ # use match sub if more than 1 if (@rdates > 1){ @rdates = match($date,@rdates); } # rdates sorted so best match is first element my $rdate = shift @rdates; print join ' ',$key,$le,fmt_mdy($date),fmt_mdy($rdate),$company, +"\n"; } } close OUTFILE; } # match dates by calc days diff # and sorting to get least diff sub match { my ($date,@rdates) = @_; my @days=(); # split date into y,m,d my @d1 = split /\D/,$date; # calc diff and store with date for my $rdate (@rdates){ my @d2 = split /\D/,$rdate; push @days,[$rdate,abs Delta_Days(@d1,@d2)]; } # sort array by days @days = sort {$a->[1] <=> $b->[1]} @days; # extract dates return map {$_->[0]} @days; } # change mm/dd/yy to yyyy-mm-dd sub fmt_ymd { my $mdy = shift; $mdy =~ s/ //g; my ($m,$d,$y) = split /\D/,$mdy; if ($y < 99){ $y += 2000 }; return sprintf "%04d-%02d-%02d",$y,$m,$d; } # change yyyy-mm-dd to mm/dd/yy sub fmt_mdy { my $ymd = shift; $ymd =~ s/ //g; return ' 'x8 if $ymd eq '1900-01-01'; my ($y,$m,$d) = split /\D/,$ymd; $y -= 2000; return sprintf "%02d/%02d/%02d",$m,$d, $y; } [download]	[reply] [d/l] [select]
Re^3: issue with output of file matching by poj (Abbot) on Jun 07, 2013 at 20:30 UTC
I suggest you comment out this line temporarily `$company .= $hashC{$pk} \|\| '';`. If you still get -C appearing in the output then check the data carefully for trailing spaces. You can also add the primary key into the output temporarily with separators like this to see spaces or other reason why they don't match up. `print join ' ',$key,$le,fmt_mdy($date),fmt_mdy($rdate),$company,"\|$pk +\|\n";` [download] poj	[reply] [d/l] [select]
Re^4: issue with output of file matching by rruser (Acolyte) on Jun 07, 2013 at 21:49 UTC
Re^5: issue with output of file matching by poj (Abbot) on Jun 08, 2013 at 12:40 UTC
Some notes below your chosen depth have not been shown here
Re: issue with output of file matching by JockoHelios (Scribe) on Jun 07, 2013 at 19:29 UTC
In describing the issue, you mention that the carryover file needs to be appended to file 1. File 1 data, then carryover data added to it. In the example of what you need the output to look like, the carryover data is prepended to file 1. Carryover data, then file 1 data added to it. Is this part of the problem you're looking to solve, or is it a copy-and-paste error when composing the post ? Dyslexics Untie !!!	[reply]
Re^2: issue with output of file matching by rruser (Acolyte) on Jun 07, 2013 at 19:40 UTC
I did figure out the formatting issue appending from carryover file to file1. So that seems to have helped some. I still do not match up because merging of the two files requires the company name be identical - however I must know which records were from the carryover file in the final output...any help is appreciated, thanks	[reply]