Input file1: col1 col2 col3 col4 ZGLP1 ICAM4 13.27 0.2425 ICAM4 ZGLP1 13.27 0.2425 RRP1B CDH24 20.8 1 ZGLP1 OOEP 18.79 0.3060 ZGLP1 RRP1B 39.62 0.2972 ZGLP1 CDH24 51.21 0.2560 BBCDI DND1 19.44 0.2833 BBCDI SOHLH2 36.61 0.2909 DND1 SOHLH2 18 0.8
Input file2: chr8 18640000 18960000 ZGLP1 RRP1B CDH24 #gene number he +re is not fixed can be #4 #5 or more chr8 19000000 19080000 BBCDI DND1 SOHLH2 #gene number he +re is not fixed can be #4 #5 or more
I have written a code which compares col1 and col2 of file1 with each line of file2 such that, if any of the pair falls anywhere in a line of file2 then programme should print "chromosome pos1 pos2 and the matching content of the file1 with values
output file: chr8 18640000 18960000 ZGLP1 RRP1B 39.62 0.2972 chr8 18640000 18960000 ZGLP1 CDH24 51.21 0.2560 chr8 18640000 18960000 RRP1B CDH24 20.8 1 chr8 19000000 19080000 BBCDI DND1 19.44 0.2833 chr8 19000000 19080000 BBCDI SOHLH2 36.61 0.2909 chr8 19000000 19080000 DND1 SOHLH2 18 0.8
so far I have tried this but it is taking so much time as my input files are huge (2gb).
my perl code open( AB, "file1" ) || die("cannot open"); open( BC, "file2" ) || die("cannot open"); open( OUT, ">output.txt" ); @file = <AB>; chomp(@file); @data = <BC>; chomp(@data); foreach $fl (@file) { if ( $fl =~ /(.*?)\s+(.*?)\s+(.*?)\s+(.*)/ ) { $one = $1; $two = $2; $thr = $3; $for = $4; } foreach $line (@data) { if ( $line =~ /(.*?)\s+(.*?)\s+(.*?)\s+(.*)+/ ) { $chr = $1; $pos1 = $2; $pos2 = $3; } if ( $line =~ /$one/ ) { if ( $line =~ /$two/ ) { print OUT $chr, "\t", $pos1, "\t", $pos2, "\t", $fl, " +\n"; } } } }
In reply to how to speed up pattern match between two files by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |