comment on

Dear Monks,

I have got two files like the example below. (one with two columns and one with four columns). I want to find the common elements of the two files if the first and second column of the second file match with the first one and also if third col ==1 and fourth col >=3.

I wrote the following code but it is not very efficient. It takes forever to make comparisons because of too much loops and conditions.

Any suggestion is appreciated.

Pedro


FILE1:

CLS_S3_Contig2719-591_592       1
CLS_S3_Contig2720-784_785       1
CLS_S3_Contig2721-139_140       1
CLS_S3_Contig2722-387_388       1
CLS_S3_Contig2724-557_560       2
CLS_S3_Contig2725-465_466       1
CLS_S3_Contig2726-627_650       12
CLPX6160.b1_O03.ab1-229_232     2
CLPX6260.b1_H05.ab1-511_512     1
CLPX627.b1_E14.ab1-373_398      13
CLPX6271.b1_N07.ab1-85_86       1

.
.
.

FILE2

CLS_S3_Contig1000    82    1    0
CLS_S3_Contig1000    83    1    0
CLS_S3_Contig1000    84    1    0
CLS_S3_Contig1000    85    1    0
CLS_S3_Contig1000    86    1    5
CLS_S3_Contig1000    87    1    0
CLS_S3_Contig1000    88    1    0
CLS_S3_Contig1000    89    1    0
CLS_S3_Contig1000    90    1    8
CLS_S3_Contig1000    91    1    0
CLS_S3_Contig1000    92    1    0
CLS_S3_Contig1000    93    0    0
CLS_S3_Contig1000    94    0    0
CLS_S3_Contig1000    95    0    9
CLS_S3_Contig1000    96    0    0
CLS_S3_Contig1000    97    0    0
CLS_S3_Contig1000    98    0    0
CLS_S3_Contig1000    99    1    0
CLS_S3_Contig1000    100    1    0
CLS_S3_Contig1000    101    1    0
CLS_S3_Contig1000    102    1    0
CLS_S3_Contig1000    103    1    3
CLS_S3_Contig1000    104    1    0
CLS_S3_Contig1000    105    1    0
.
.
.
[download]

################################################################
# Read the first file, break the first col to its components   #
# Expand the last two last numbers e.g. (591_592) plus/minus 8 #      
# Make a hash of multiple value for each key                   #
# Print the numner of lines read and put into a variable       #
################################################################

my %file1=();
while(<INPUT1>){
         chomp;
         (my $id, my $number) = split("\t", $_);

          if ($id=~ m/^(CLS_S3_Contig[0-9]+)([-]?)([0-9]+)([_]?)([0-9]
++)$/i) {

              my $matched_id=$id; # breaks the CLS_Contig1000_200-202 
+to its componenents
                                  # and expands the second col plus mi
+nus 8 
                  for (my $i=$3-8;$i<$5+8;$i++){
              print join ("\t", $1, $i), "\n";
              push (@{$file1{$1}}, $i); #make a hash of array
        }
   }
}

           # Count the numnber of lines minus header line
           
           my $counter_1 = `wc -l < $ARGV[0]`;
           die "wc failed: $?"
           if $?;
           chomp($counter_1);
           my $counter = $counter_1 -1; #First file has a header row 
           print "$counter lines read from  $ARGV[0] file\n";

close(INPUT1);


###########################################################
#                    Reading the Second file              #
###########################################################

print "Reading the 2nd file\n";
print "It may take a while, please wait...\n";
print "-----------------------------------\n";



while(<INPUT2>){
         chomp;
         my @current_line  = split /\t/;

        foreach my $key (sort keys %file1){
                 foreach my $position1 (@{$file1{$key}}){
            if ($current_line[0] eq $key) {
              if ($current_line[1] == $position1) {
                  if ($current_line[2] ==1) {
                      if ($current_line[3] >= 3) {

          print join ("\t", $current_line[0],$current_line[1],$current
+_line[2],$current_line[3], "***",$key, $position1), "\n";
                   }
                }
              }
           }
        }
    }
}

close (INPUT2);
[download]

In reply to Reading two files, cmp certain cols by sesemin

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.