Comparing multiple entries from two files

hanger4 has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I've been working on a project for the past week or so and have gotten stumped on this step. It's easier to explain with an example so here it is...

There are two input files, an ID file of 3 columns and a Name file of four columns (the actual data are large ~ 100,000 rows):

ID file:
ID Catg_ID Pos
12 A 16
15 B 5
16 A 175

Name file:
Catg_Name Start Stop Name
A 8 19 jamm
A 110 112 bbc
E 170 256 vadd
A 14 18 cip

For each row in the ID file, I need to compare the Catg_ID variable to the Catg_Name variable in the Name file. Then, if the catg variables match, I want to see if Pos is between Start and Stop. For rows that meet both the above criteria, I want to combine them in a single output file.

So, to recap, IF Catg_ID = Catg_Name AND Pos > Start AND Pos < Stop I want to print "ID Catg_ID Pos Name"

And just to be clear, I need to compare not the first line in the ID file to the first line in the name file, but the first line in the ID file to every line in the name file, then the second line in the ID file to every line in the name file, and so on. Multiple matches are fine.

For the above data, I would want the program to print:
12 A 16 jamm
12 A 16 cip

Thanks in advance. I've tried to be as clear and specific as possible.

Comment on Comparing multiple entries from two files

Replies are listed 'Best First'.
Re: Comparing multiple entries from two files by ww (Archbishop) on Jul 08, 2009 at 00:53 UTC
So what have you tried? Welcome to the Monastery (and be not offended; many new monks have had a first post get that for an initial answer). But we'll likely be more helpful if you: Post the script you've tried, along with an explanation of how (error messages, warnings, unexpected or no output) it fails to do what you want. Be a bit more specific about your data: are the source files text (free text), CSV, out of a DB or what. Read How do I post a question effectively? and Markup in the Monastery If you haven't written any code yet, you might want to think about these semi-random observations: Do you have enough RAM to read "`~ 100,000 rows`" (x 2 ???) into memory at once? That may determine the manner in which you proceed. If you're just starting Perl and have no programming experience, you may need to tackle simpler tasks before tackling this one. In that case, you're in the right place: see Tutorials, use Super Search and consider a book such as Learning Perl.	[reply]
Re: Comparing multiple entries from two files by jethro (Monsignor) on Jul 08, 2009 at 00:19 UTC
Use a hash to store all lines of the ID file with Catg_ID as key and the other values as data (as a string "12 16"). This should hopefully still fit into memory, if not use something like DMB::Deep to store the hash on disk Then just read the Name file line by line, check for the ID in the hash and compare.	[reply]
Re: Comparing multiple entries from two files by hobbs (Monk) on Jul 08, 2009 at 07:56 UTC
Have you considered loading the data into a relational DB (SQLite would do just fine) and querying it with SQL? Certainly you could do the job with Perl, but processing significant numbers of rows, coordinating two different datasets based on shared identifiers, and doing filtering based on the joined data is exactly what relational databases were created for.	[reply]
Re: Comparing multiple entries from two files by bichonfrise74 (Vicar) on Jul 08, 2009 at 17:01 UTC
Try something like this... #!/usr/bin/perl use strict; my $data_1 = <<EOF; ID Catg_ID Pos 12 A 16 15 B 5 16 A 175 EOF my %record; open( my $file_1, "<", \$data_1 ) or die "Cannot open data_1\n"; while (<$file_1>) { next if ( /^ID/ ); s/(\d+)\s//; $record{$1} = [ split ]; } while (<DATA>) { next if ( /^Catg/ ); my @cols = split; for my $i ( keys %record ) { print "$i $cols[0] $record{$i}->[1] $cols[3]\n" if (( $record{$i}->[0] eq $cols[0] ) && ( $record{$i}->[1] > $cols[1] ) && ( $record{$i}->[1] < $cols[2] )) } } __DATA__ Catg_Name Start Stop Name A 8 19 jamm A 110 112 bbc E 170 256 vadd A 14 18 cip [download]	[reply] [d/l]
Re^2: Comparing multiple entries from two files by hanger4 (Initiate) on Jul 08, 2009 at 21:21 UTC
Thank you bichonfrise74. Your suggestion works very well. It is much simpiler and quite a bit faster than what I was using: #!/usr/bin/perl use warnings; use strict; my $idfile = 'C:...'; open(ID, $idfile) or die "Cannot open file '$idfile' \n\n"; #Create and open output file my $out = 'C:\Users\Clayton\Documents\Research\GO\Data\chr_8\chr_8_pos +gen'; open (OUT, ">$out"); # Create Variables my $line; my $line2; my %hash = (); # Read in file line by line while ( $line = <ID> ) { chomp $line; # Split the tab delimited file into an array my @arr = split /\t/, $line; #if(defined($arr[2])){ # Put the columns of the array into variables my $id = $arr[0]; my $catg_id = $arr[1]; my $pos = $arr[2]; $hash{$id} = "$catg_id\t$pos"; } close ID; my @k = keys %hash; my $k; my $namefile = 'C:...'; open(NAMEFILE, $namefile) or die "Cannot open file '$namefile' \n\n"; while ( $line2 = <NAMEFILE> ) { chomp $line2; my @loc = split /\t/, $line2; foreach $k (@k) { my @arr2 = split /\t/, $hash{$k}; if($loc[0] == $arr2[0]){ if(($arr2[1] >= $loc[1]) and ($arr2[1] <= $loc[2])){ print OUT "$arr2[0]\t$k\t$arr2[1]\t$loc[1]\t$loc[2]\t$ +loc[3]\n" }}}} [download] I'm still new to perl so I guess a lot of my methods are not the most efficient. One more question though... Whenever I use your code or mine I get an error "Use of uninitialized value in numeric gt (>)at C:\...line 32, <NAMEFILE> line 1. This error repeats many times for each line in the file. The output is what I expect it to be though and the program runs without any problems (other than generating a bunch of warnings).	[reply] [d/l]
Re^3: Comparing multiple entries from two files by toolic (Bishop) on Jul 08, 2009 at 23:31 UTC
You should eliminate the warnings since they may indicate bugs in your code. I do not know which line is line # 32, but perhaps you could try to determine the cause of the warnings by printing the contents of your `@loc` and `@arr2` arrays. Do they have as many elements as you think they have?	[reply] [d/l] [select]
Re^4: Comparing multiple entries from two files by Anonymous Monk on Jul 09, 2009 at 00:06 UTC