comment on

Here is my take. Since we have not been told how the program knows the header labels, I will assume that they are tab delimited on the first line of the file. :)

To use it, just enter the first file name, the fields to be used, the second file, and the fields to be used. The first field in each list is the one that "matches" with the other. For example,

merge.pl AA 1,2,5,6 BB 3,4,1,5
[download]

would print any records where the third field in the file "BB" matched the first file in the file "AA", and then would print out fields one, two, five, and six from A, followed by four, one, and five from B. Formatting is tab-delmited.

A final assumption is that the input files are whitespace delimited, but this could easily be changed to tab-delimited.


## merge.pl

use strict;

my $AFile = shift; my @Acols = split(/,/ => shift);
my $BFile = shift; my @Bcols = split(/,/ => shift);

## Read B into memory first
open(B, "$BFile") or die "Could not open $BFile: $!\n";

## Grab the header labels from the first line and store them for later
+:
my @HeaderB = split(/\t/, <B>); chomp @HeaderB;

## Now go through and save each line into a hash, where they key
## is the field to be matched, and the value is a reference to 
## an array that holds all the fields
my %B;
while(<B>) {
  my @bar = split(/\s+/ => $_); ## Change to tab if needed
  $bar[1] or next; ## Skip blank lines: add other validation if needed
  $B{$bar[$Bcols[0]-1]}=\@bar;
}
close(B);
shift @Bcols; ## Remove B's first header: we will use A's

open(A, "$AFile") or die "Could not open $AFile: $!\n";

## Print all the headers now:
my @HeaderA = split(/\t/, <A>); chomp @HeaderA;
for (@Acols) { print "$HeaderA[$_-1]\t"; }
for (@Bcols) { print "$HeaderB[$_-1]\t"; } ## Remember that shift?
print "\n";

## Save the offset of the "matching field" into a variable
## Mainly makes things easier to read below
my $A=$Acols[0]-1;

while(<A>) {
  my @bar = split(/\s+/ => $_); ## Change to tab if needed
  $bar[1] or next;
  if ($B{$bar[$A]}) { ## We have a match from %B!
    ## Print all the A fields we want:
    for (@Acols) { print "$bar[$_-1]\t"; }
    ## Print all the B fields we want:
    for (@Bcols) { print "$B{$bar[$A]}[$_]\t"; }
    print "\n";
  }
}
close(A);
[download]

In reply to Re: Merging Files Conundrum: A Better Explanation(?) by turnstep
in thread Merging Files Conundrum: A Better Explanation(?) by Limo

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.