compare two files on the basis of Two IDs

genome has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: compare two files on the basis of Two IDs by Marshall (Canon) on Sep 26, 2016 at 22:20 UTC
A few points: You need () around the whole "if statement" clause as hippo pointed out. Your code will run very slowly because it reads the complete file2 again and again for every line in file1. If file2 is big, this will make a significant difference. Consider reading one of the files into memory to prevent a lot of slow file system I/O. A hash based data structure for that memory data will also speed things up considerably vs a linear search. Consider using a split on /\s+/ or ' ' instead of \t. That splits on a sequence of one or more white space characters. Those include the \t, \n and actual space characters, so a chomp is not needed after a split like that. Also if you get a file that has actual spaces instead of the tabs, code will still work. Consider using "use strict" and "my" variables. That will give additional compile error info that is helpful. But the code "as is" produces the appropriate error message. Consider indenting the code to show the "levels" better. What you have is hard to read. Update: You say " "last;" is also not working here..". I don't see any "last;" statement in the code, should be fine if put in the right place. Step 1: get your code to compile. I suggest your read chomp doc to understand what chomp() does. If you insist on splitting on \t, chomp the input line first as the doc's suggest. Update2: Some code: I can tell that you are beginner at Perl and because this doesn't look like homework, I wrote some code for you that incorporates my advice above. I hope some actual code is easier to understand than general advice. Please play with this and adapt it to your needs. It is possible to make an "in memory" data structure of either file1 or file2. In this case, I picked file 1 and generated a hash table from it. The keys of this "file 1 hash table" are like "chr17:69112551" and the value of each key like that is set to "1" although that "1" value is never used in my code. From looking at your code, it appears that the desired output is one line for each line in file 2. In your code, `if ($ary[0] eq $any[0] and $ary[1] == $any[1])` has been transformed into: `if ( exists ($file1_hash{"$any[0]:$any[1]"}))`. Using a combined hash key like this expresses the "and" function. Then there is another condition for the "or" function. The net effect of code like this is that each line in file 1 or file 2 is only read once. File I/O is "expensive" in terms of CPU power. Every line in file 1 is read and a hash table created. Then for each line in file 2, the line is read, parsed and a decision is reached based upon the result of 2 look-up statements into the file 1 hash. These hash look-ups are very efficient and scale to very big files. I couldn't see any way to get an "E" with your test data, so I added some extra data to my test cases. In the future, it is best if you can provide an example "desired output" that demo's the basic decisions which need to be made. Have fun. Ask questions if you don't understand. If I made a mistake and didn't understand something, asking about that is fine too. Oops, just had a thought that you wanted an output line per line of file 1 instead of file 2? In that case, code changes, possibly make a HoA, Hash of Array out of file 2 to start off. #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my $file1 = <<END; chr17 69112551 chr1 67058869 chr7 151046672 chr7 151047369 chr1 66953654 END my $file2 = <<END; chr1 66953622 66953654 chr1 67200451 67200472 chr1 67200475 67200478 chr1 67058869 67058880 chr1 67058881 67058885 chr1 67058887 67058895 END open my $infile1, '<', \$file1 or die "unable to open first file $!"; open my $infile2, '<', \$file2 or die "unable to open 2nd file $!"; ### create memory structure of file 1: ### so that we only have to read file1 once! # my %file1_hash; while (my $line = <$infile1>) { next if $line =~ /^\s$/; #skip blank lines (a common infile goof +) my ($key, $value) = split /\s+/, $line; # use better "names" I have # no idea of what a chr col + means $file1_hash{"$key:$value"} = 1; } close $infile1; # file handle closure is optional, but I'd do it. ### process each line in file2: ### If a line "matches" with any line in file1, then "E", else "M" ### I don't know that these numbers mean, come up with better comment +. while (my $line = <$infile2>) { chomp $line; #so that output with E or M can be on same line next if $line =~ /^\s$/; #skip blank lines (a common infile goof +) my ($chr, $val1, $val2) = split /\s+/,$line; if ( exists $file1_hash{"$chr:$val1"} or exists $file1_hash{"$chr:$val2"} ) { print "$line\tE\n"; # match exists with file 1 } else { print "$line\tM\n"; # match does NOT exist with file 1 } } __END__ Prints the following: chr1 66953622 66953654 E chr1 67200451 67200472 M chr1 67200475 67200478 M chr1 67058869 67058880 E chr1 67058881 67058885 M chr1 67058887 67058895 M [download]	[reply] [d/l] [select]
Re^2: compare two files on the basis of Two IDs by genome (Novice) on Sep 27, 2016 at 21:31 UTC
Hi, Thanks for your reply. I tried to print the result for file 1, but not for file 2. I am trying the code, but not worked for me. #!/usr/bin/perl #use warnings; #use strict; #use Data::Dumper; my $file1 = $ARGV[0]; open($infile1,$file1); my $file2 = $ARGV[1]; open($infile2,$file2); my %file2_hash; while (my $line = <$infile2>) { next if $line =~ /^\s$/; #skip blank lines (a common infile goof +) my ($key, $value1, $value2) = split /\s+/, $line; # use better "nam +es" I have # no idea of what a chr col + means $file2_hash{"$key:$value1:$value2"} = 1; } close $infile2; while (my $line = <$infile1>) { chomp $line; #so that output with E or M can be on same line next if $line =~ /^\s$/; #skip blank lines (a common infile goof +) my ($chr, $value1) = split /\s+/,$line; if (exists $file2_hash{"$chr:$value1"} or exists $file2_hash{"$chr: +$value2"} ) { print "$line\tE\n"; # match exists with file 1 } else { print "$line\tM\n"; # match does NOT exist with file 1 } } [download]	[reply] [d/l]
Re^3: compare two files on the basis of Two IDs by marinersk (Priest) on Sep 28, 2016 at 03:05 UTC
This code works for me. Is there something wrong with the output? a.dat: `chr17 69112551 chr1 67058869 chr7 151046672 chr7 151047369 chr1 66953654` [download] b.dat: `chr1 66953622 66953654 chr1 67200451 67200472 chr1 67200475 67200478 chr1 67058869 67058880 chr1 67058881 67058885 chr1 67058887 67058895` [download] Results: `S:\Steve\PerlMonks>compare.pl a.dat b.dat chr17 69112551 M chr1 67058869 M chr7 151046672 M chr7 151047369 M chr1 66953654 M S:\Steve\PerlMonks>` [download]	[reply] [d/l] [select]
Re^4: compare two files on the basis of Two IDs by genome (Novice) on Sep 28, 2016 at 13:39 UTC
Re^5: compare two files on the basis of Two IDs by Marshall (Canon) on Sep 28, 2016 at 21:37 UTC
Some notes below your chosen depth have not been shown here
Re: compare two files on the basis of Two IDs by hippo (Archbishop) on Sep 26, 2016 at 21:39 UTC
The if condition of this prog. is working good Actually, it isn't. I know this because your code doesn't even compile: `$ perl -c 1172674.pl syntax error at 1172674.pl line 21, near ") or" syntax error at 1172674.pl line 26, near "else" syntax error at 1172674.pl line 30, near "}" 1172674.pl had compilation errors.` [download] Your `if` statement is badly formed which is why it won't compile.	[reply] [d/l] [select]