sesemin has asked for the wisdom of the Perl Monks concerning the following question:
I have got two files like the example below. (one with two columns and one with four columns). I want to find the common elements of the two files if the first and second column of the second file match with the first one and also if third col ==1 and fourth col >=3.
I wrote the following code but it is not very efficient. It takes forever to make comparisons because of too much loops and conditions.
Any suggestion is appreciated.
Pedro
FILE1: CLS_S3_Contig2719-591_592 1 CLS_S3_Contig2720-784_785 1 CLS_S3_Contig2721-139_140 1 CLS_S3_Contig2722-387_388 1 CLS_S3_Contig2724-557_560 2 CLS_S3_Contig2725-465_466 1 CLS_S3_Contig2726-627_650 12 CLPX6160.b1_O03.ab1-229_232 2 CLPX6260.b1_H05.ab1-511_512 1 CLPX627.b1_E14.ab1-373_398 13 CLPX6271.b1_N07.ab1-85_86 1 . . . FILE2 CLS_S3_Contig1000 82 1 0 CLS_S3_Contig1000 83 1 0 CLS_S3_Contig1000 84 1 0 CLS_S3_Contig1000 85 1 0 CLS_S3_Contig1000 86 1 5 CLS_S3_Contig1000 87 1 0 CLS_S3_Contig1000 88 1 0 CLS_S3_Contig1000 89 1 0 CLS_S3_Contig1000 90 1 8 CLS_S3_Contig1000 91 1 0 CLS_S3_Contig1000 92 1 0 CLS_S3_Contig1000 93 0 0 CLS_S3_Contig1000 94 0 0 CLS_S3_Contig1000 95 0 9 CLS_S3_Contig1000 96 0 0 CLS_S3_Contig1000 97 0 0 CLS_S3_Contig1000 98 0 0 CLS_S3_Contig1000 99 1 0 CLS_S3_Contig1000 100 1 0 CLS_S3_Contig1000 101 1 0 CLS_S3_Contig1000 102 1 0 CLS_S3_Contig1000 103 1 3 CLS_S3_Contig1000 104 1 0 CLS_S3_Contig1000 105 1 0 . . .
################################################################ # Read the first file, break the first col to its components # # Expand the last two last numbers e.g. (591_592) plus/minus 8 # # Make a hash of multiple value for each key # # Print the numner of lines read and put into a variable # ################################################################ my %file1=(); while(<INPUT1>){ chomp; (my $id, my $number) = split("\t", $_); if ($id=~ m/^(CLS_S3_Contig[0-9]+)([-]?)([0-9]+)([_]?)([0-9] ++)$/i) { my $matched_id=$id; # breaks the CLS_Contig1000_200-202 +to its componenents # and expands the second col plus mi +nus 8 for (my $i=$3-8;$i<$5+8;$i++){ print join ("\t", $1, $i), "\n"; push (@{$file1{$1}}, $i); #make a hash of array } } } # Count the numnber of lines minus header line my $counter_1 = `wc -l < $ARGV[0]`; die "wc failed: $?" if $?; chomp($counter_1); my $counter = $counter_1 -1; #First file has a header row print "$counter lines read from $ARGV[0] file\n"; close(INPUT1); ########################################################### # Reading the Second file # ########################################################### print "Reading the 2nd file\n"; print "It may take a while, please wait...\n"; print "-----------------------------------\n"; while(<INPUT2>){ chomp; my @current_line = split /\t/; foreach my $key (sort keys %file1){ foreach my $position1 (@{$file1{$key}}){ if ($current_line[0] eq $key) { if ($current_line[1] == $position1) { if ($current_line[2] ==1) { if ($current_line[3] >= 3) { print join ("\t", $current_line[0],$current_line[1],$current +_line[2],$current_line[3], "***",$key, $position1), "\n"; } } } } } } } close (INPUT2);
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Reading two files, cmp certain cols
by jethro (Monsignor) on Sep 19, 2008 at 02:11 UTC | |
by sesemin (Beadle) on Sep 19, 2008 at 02:27 UTC | |
by sesemin (Beadle) on Sep 19, 2008 at 16:40 UTC | |
by jethro (Monsignor) on Sep 19, 2008 at 23:39 UTC | |
by sesemin (Beadle) on Sep 20, 2008 at 20:36 UTC | |
| |
by Cristoforo (Curate) on Sep 20, 2008 at 23:17 UTC | |
|
Re: Reading two files, cmp certain cols
by mscharrer (Hermit) on Sep 19, 2008 at 09:02 UTC | |
|
Re: Reading two files, cmp certain cols
by Cristoforo (Curate) on Sep 19, 2008 at 16:15 UTC | |
by sesemin (Beadle) on Sep 21, 2008 at 09:09 UTC | |
by sesemin (Beadle) on Sep 19, 2008 at 18:06 UTC | |
by FunkyMonk (Bishop) on Sep 19, 2008 at 20:46 UTC | |
by sesemin (Beadle) on Sep 20, 2008 at 05:43 UTC | |
by sesemin (Beadle) on Sep 20, 2008 at 22:54 UTC |