Dear Monks,

I have got two files like the example below. (one with two columns and one with four columns). I want to find the common elements of the two files if the first and second column of the second file match with the first one and also if third col ==1 and fourth col >=3.

I wrote the following code but it is not very efficient. It takes forever to make comparisons because of too much loops and conditions.

Any suggestion is appreciated.

Pedro

FILE1: CLS_S3_Contig2719-591_592 1 CLS_S3_Contig2720-784_785 1 CLS_S3_Contig2721-139_140 1 CLS_S3_Contig2722-387_388 1 CLS_S3_Contig2724-557_560 2 CLS_S3_Contig2725-465_466 1 CLS_S3_Contig2726-627_650 12 CLPX6160.b1_O03.ab1-229_232 2 CLPX6260.b1_H05.ab1-511_512 1 CLPX627.b1_E14.ab1-373_398 13 CLPX6271.b1_N07.ab1-85_86 1 . . . FILE2 CLS_S3_Contig1000 82 1 0 CLS_S3_Contig1000 83 1 0 CLS_S3_Contig1000 84 1 0 CLS_S3_Contig1000 85 1 0 CLS_S3_Contig1000 86 1 5 CLS_S3_Contig1000 87 1 0 CLS_S3_Contig1000 88 1 0 CLS_S3_Contig1000 89 1 0 CLS_S3_Contig1000 90 1 8 CLS_S3_Contig1000 91 1 0 CLS_S3_Contig1000 92 1 0 CLS_S3_Contig1000 93 0 0 CLS_S3_Contig1000 94 0 0 CLS_S3_Contig1000 95 0 9 CLS_S3_Contig1000 96 0 0 CLS_S3_Contig1000 97 0 0 CLS_S3_Contig1000 98 0 0 CLS_S3_Contig1000 99 1 0 CLS_S3_Contig1000 100 1 0 CLS_S3_Contig1000 101 1 0 CLS_S3_Contig1000 102 1 0 CLS_S3_Contig1000 103 1 3 CLS_S3_Contig1000 104 1 0 CLS_S3_Contig1000 105 1 0 . . .
################################################################ # Read the first file, break the first col to its components # # Expand the last two last numbers e.g. (591_592) plus/minus 8 # # Make a hash of multiple value for each key # # Print the numner of lines read and put into a variable # ################################################################ my %file1=(); while(<INPUT1>){ chomp; (my $id, my $number) = split("\t", $_); if ($id=~ m/^(CLS_S3_Contig[0-9]+)([-]?)([0-9]+)([_]?)([0-9] ++)$/i) { my $matched_id=$id; # breaks the CLS_Contig1000_200-202 +to its componenents # and expands the second col plus mi +nus 8 for (my $i=$3-8;$i<$5+8;$i++){ print join ("\t", $1, $i), "\n"; push (@{$file1{$1}}, $i); #make a hash of array } } } # Count the numnber of lines minus header line my $counter_1 = `wc -l < $ARGV[0]`; die "wc failed: $?" if $?; chomp($counter_1); my $counter = $counter_1 -1; #First file has a header row print "$counter lines read from $ARGV[0] file\n"; close(INPUT1); ########################################################### # Reading the Second file # ########################################################### print "Reading the 2nd file\n"; print "It may take a while, please wait...\n"; print "-----------------------------------\n"; while(<INPUT2>){ chomp; my @current_line = split /\t/; foreach my $key (sort keys %file1){ foreach my $position1 (@{$file1{$key}}){ if ($current_line[0] eq $key) { if ($current_line[1] == $position1) { if ($current_line[2] ==1) { if ($current_line[3] >= 3) { print join ("\t", $current_line[0],$current_line[1],$current +_line[2],$current_line[3], "***",$key, $position1), "\n"; } } } } } } } close (INPUT2);

In reply to Reading two files, cmp certain cols by sesemin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.