sesemin has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I think this is a unique situation. I tried it with hash but I think because the keys of hash are unique I cannot implement if with hash. Then I switched to array of array which I think it will work. Obviously my code does not work but want to show one way that it can be implemented.

here is the problem and examples.

1- Read column one of File1

2- Split the string in that column to its components (gene name-startbase_endbase)

3- We will have genename, startbase, end base.

4- Put it into an array of array

5- Read column one of File2 6- Do the same thing as 2-4

7- Query through array of array of file1 and find common elements in file2 that:

a. First match genename if they matched then check if

b. Start position of the matched genename in file1 falls between start and end position of the same genename in file2

File1: CLS_S3_Contig2721-139_168 CLS_S3_Contig2722-375_390 CLS_S3_Contig2725-323_362 CLS_S3_Contig2725-455_480 CLS_S3_Contig2728-117_144 CLS_S3_Contig2728-437_472 CLS_S3_Contig2729-119_130 CLS_S3_Contig2729-163_220 CLS_S3_Contig2730-181_202 CLS_S3_Contig2730-361_384 CLS_S3_Contig2731-824_843 CLS_S3_Contig2731-1150_1201 CLS_S3_Contig2735-571_636 CLS_S3_Contig2735-677_710 CLS_S3_Contig2735-775_810 . . .
File2 CLS_S3_Contig2721-142_169 CLS_S3_Contig6525-509_514 CLS_S3_Contig6525-493_502 CLS_S3_Contig6525-503_508 CLS_S3_Contig2977-365_376 CLS_S3_Contig2977-77_82 CLS_S3_Contig2977-83_90 CLS_S3_Contig4978-271_274 CLS_S3_Contig4978-385_388 CLS_S3_Contig2730-365_389 . . .
Output: Genename(file1) start end ** Genename(file2) start end CLS_S3_Contig2721 139 168 ** CLS_S3_Contig2721 142 169 CLS_S3_Contig2730 361 384 ** CLS_S3_Contig2730 365 389 . .
while(<INPUT1>){ chomp; my @id = split /\t/; if ($id[0] =~ /(.+?)\-(\d+?)_(\d+)/) { my @line_map = ("$1", $2, $3); push @file_map, [@line_map]; } } close(INPUT1); while(<INPUT2>){ chomp; my @map_id = split /\t/; if ($tg_id[0] =~ /(.+?)\-(\d+?)_(\d+)/) { my @tg_id = ("$1", $2, $3); push @file_tg, [@tg_id]; } } if (($from_tg == $from_map) && ($to_tg == $to_map)){ print join("\t",$two_geno_id, $from_map,$to_map,"<-Ma +pside**TGside->",$two_geno_id, $from_tg, $to_tg, $from_map_tg_range, +$to_map_tg_range),"\n"; $lines_1++; } elsif (($from_tg < $to_map) && ($from_tg > $from_map)){ print join("\t",$two_geno_id, $from_map,$to_map,"<-Ma +pside**TGside->",$two_geno_id, $from_tg, $to_tg, $from_map_tg_range, +$to_map_tg_range),"\n"; $lines_9++; }

Replies are listed 'Best First'.
Re: Query through array of array
by puudeli (Pilgrim) on Jan 22, 2009 at 10:12 UTC

    But you can do it with hashes! Just use two hashes, one for each file. Then you can use the keys from data file 1 to find out whether corresponding key exists in the hash from data file 2. Below is a quick example:

    #! /usr/bin/perl use strict; use warnings; my @d1 = qw{ CLS_S3_Contig2721-139_168 CLS_S3_Contig2722-375_390 CLS_S3_Contig2725-323_362 CLS_S3_Contig2725-455_480 CLS_S3_Contig2728-117_144 CLS_S3_Contig2728-437_472 CLS_S3_Contig2729-119_130 CLS_S3_Contig2729-163_220 CLS_S3_Contig2730-181_202 CLS_S3_Contig2730-361_384 CLS_S3_Contig2731-824_843 CLS_S3_Contig2731-1150_1201 CLS_S3_Contig2735-571_636 CLS_S3_Contig2735-677_710 CLS_S3_Contig2735-775_810 }; my @d2 = qw{ CLS_S3_Contig2721-142_169 CLS_S3_Contig6525-509_514 CLS_S3_Contig6525-493_502 CLS_S3_Contig6525-503_508 CLS_S3_Contig2977-365_376 CLS_S3_Contig2977-77_82 CLS_S3_Contig2977-83_90 CLS_S3_Contig4978-271_274 CLS_S3_Contig4978-385_388 CLS_S3_Contig2730-365_389 }; my %d_one = (); my %d_two = (); for( @d1 ) { my ($key, $start, $end) = $_ =~ /(.*)-(\d+)_(\d+)/; $d_one{$key}{'start'} = $start; $d_one{$key}{'end'} = $end; } for( @d2 ) { my ($key, $start, $end) = $_ =~ /(.*)-(\d+)_(\d+)/; $d_two{$key}{'start'} = $start; $d_two{$key}{'end'} = $end; } print "Genename(file1) start end ** Genename(file2) start end\ +n\n"; foreach my $key_one ( keys %d_one ) { if( $d_two{$key_one} ) { if( $d_two{$key_one}{'start'} >= $d_one{$key_one}{'start'} && $d_two{$key_one}{'start'} <= $d_one{$key_one}{'end'} ) { print "$key_one $d_one{$key_one}{'start'} $d_one{$ +key_one}{'end'}" . "** " . "$key_one $d_two{$key_one}{'start'} $d_two{$ +key_one}{'end'}\n"; } } }
    Note, the limits are inclusive in my code. It produces the following output:
    cpt2jmo@phantom:~$ ./2_hashes.pl Genename(file1) start end ** Genename(file2) start end CLS_S3_Contig2721 139 168** CLS_S3_Contig2721 142 169 CLS_S3_Contig2730 361 384** CLS_S3_Contig2730 365 389

    Update: I removed the readmore -tags, since the code is not that long.

    --
    seek $her, $from, $everywhere if exists $true{love};
      Hi puudeli,

      Thank you very much for your nice code. I am just wondering if it works if there are redundant keys. As you can see in my input files the gene names may repeat over and over. This will be in jeopardy of key concept of hash in perl?

      thanks again

        Yes, that would ruin my approach. In this case I would introduce and array to the hash. It complicates things, but only a little bit :-)

        Update: Corrected a wrong hash key in the printout.

        Update2: Removed incorrect next statement.

        Here is the modified version. The algo can get heavy if the data is large, so maybe wiser monks could help a bit here?

        #! /usr/bin/perl use strict; use warnings; my @d1 = qw{ CLS_S3_Contig2721-139_168 CLS_S3_Contig2722-375_390 CLS_S3_Contig2725-323_362 CLS_S3_Contig2725-455_480 CLS_S3_Contig2728-117_144 CLS_S3_Contig2728-437_472 CLS_S3_Contig2729-119_130 CLS_S3_Contig2729-163_220 CLS_S3_Contig2730-181_202 CLS_S3_Contig2730-361_384 CLS_S3_Contig2731-824_843 CLS_S3_Contig2731-1150_1201 CLS_S3_Contig2735-571_636 CLS_S3_Contig2735-677_710 CLS_S3_Contig2735-775_810 }; my @d2 = qw{ CLS_S3_Contig2721-142_169 CLS_S3_Contig6525-509_514 CLS_S3_Contig6525-493_502 CLS_S3_Contig6525-503_508 CLS_S3_Contig2977-365_376 CLS_S3_Contig2977-77_82 CLS_S3_Contig2977-83_90 CLS_S3_Contig4978-271_274 CLS_S3_Contig4978-385_388 CLS_S3_Contig2730-365_389 }; { my %d_one = (); my %d_two = (); for( @d1 ) { my ($key, $start, $end) = $_ =~ /(.*)-(\d+)_(\d+)/; push @{ $d_one{$key} }, { 'start' => $start, 'end' => $end }; } for( @d2 ) { my ($key, $start, $end) = $_ =~ /(.*)-(\d+)_(\d+)/; push @{ $d_two{$key} }, { 'start' => $start, 'end' => $end }; + } use Data::Dumper; print Dumper(\%d_one, \%d_two); print "Genename(file1)\t\tstart\tend\t**\tGenename(file2)\t\tstart +\tend\n\n"; foreach my $key_one ( keys %d_one ) { if( $d_two{$key_one} ) { # We found a matching key! foreach my $limits_one ( @{ $d_one{$key_one} } ) { foreach my $limits_two ( @{ $d_two{$key_one} } ) { if( $limits_two->{'start'} >= $limits_one->{'start +'} && $limits_two->{'start'} <= $limits_one->{'end'} + ) { print "$key_one\t$limits_one->{'start'}\t$ +limits_one->{'end'}" . "\t**\t" . "$key_one\t$limits_two->{'start'}\t$ +limits_two->{'end'}\n"; } } } } } }
        Note, write test cases for this or test otherwise extensively. Below is the current output of my script:
        cpt2jmo@phantom:~/misc$ ./2_hashes_with_arrays.pl $VAR1 = { 'CLS_S3_Contig2722' => [ { 'end' => '390', 'start' => '375' } ], 'CLS_S3_Contig2725' => [ { 'end' => '362', 'start' => '323' }, { 'end' => '480', 'start' => '455' } ], 'CLS_S3_Contig2729' => [ { 'end' => '130', 'start' => '119' }, { 'end' => '220', 'start' => '163' } ], 'CLS_S3_Contig2731' => [ { 'end' => '843', 'start' => '824' }, { 'end' => '1201', 'start' => '1150' } ], 'CLS_S3_Contig2728' => [ { 'end' => '144', 'start' => '117' }, { 'end' => '472', 'start' => '437' } ], 'CLS_S3_Contig2721' => [ { 'end' => '168', 'start' => '139' } ], 'CLS_S3_Contig2735' => [ { 'end' => '636', 'start' => '571' }, { 'end' => '710', 'start' => '677' }, { 'end' => '810', 'start' => '775' } ], 'CLS_S3_Contig2730' => [ { 'end' => '202', 'start' => '181' }, { 'end' => '384', 'start' => '361' } ] }; $VAR2 = { 'CLS_S3_Contig2721' => [ { 'end' => '169', 'start' => '142' } ], 'CLS_S3_Contig6525' => [ { 'end' => '514', 'start' => '509' }, { 'end' => '502', 'start' => '493' }, { 'end' => '508', 'start' => '503' } ], 'CLS_S3_Contig2730' => [ { 'end' => '389', 'start' => '365' } ], 'CLS_S3_Contig4978' => [ { 'end' => '274', 'start' => '271' }, { 'end' => '388', 'start' => '385' } ], 'CLS_S3_Contig2977' => [ { 'end' => '376', 'start' => '365' }, { 'end' => '82', 'start' => '77' }, { 'end' => '90', 'start' => '83' } ] }; Genename(file1) start end ** Genename(file2) + start end CLS_S3_Contig2721 139 168 ** CLS_S3_Contig2721 + 142 169 CLS_S3_Contig2730 361 384 ** CLS_S3_Contig2730 + 365 389
        --
        seek $her, $from, $everywhere if exists $true{love};
Re: Qurey through array of array
by rovf (Priest) on Jan 22, 2009 at 13:08 UTC
    Obviously my code does not work

    I find that your code doesn't provide much safety net, so you have a hard time seeing where it fails. for instance, you split on tabs. Can you be sure, that the fields will really be separated by tab characters always (and not sometimes by spaces)? Are you sure that a line can't have embedded spaces? And if your line does not satisfy the regexp pattern, is it OK to silently ignore that line, or should there be printed an error message?

    Further, your program contains a few variables (for instance, $from_tg) which are never initialized, so I guess you didn't provide the whole code.

    And, are you really using strict and warnings in your original code?

    -- 
    Ronald Fischer <ynnor@mm.st>
      Hi Ronald,

      Thanks for the comments. I am sure the the fields are tab separated and there will not be any white line in between the lines. The main problem is to query through the first array of array (INPUT1) and find the items that meet the conditions set. Thank you again.

      Pedro

Re: Qurey through array of array
by brsaravan (Scribe) on Jan 22, 2009 at 13:07 UTC
    If your file size is small then you can use this simple approach.
    open(INPUT1, "file1"); my ($havar1, $havar2i, $Genename); while (<INPUT1>) { my $startvar; if ($_ =~ /([^-]+)-(.*)/) { $Genename = $1; my @sparr = split('_', $2); open(INPUT2, "file2"); while(<INPUT2>) { if ($_ =~ /([^-]+)-(.*)/) { my @sparr1 = split('_', $2); if (($Genename eq $1) && ($sparr[0] < +$sparr1[0]) && ($sparr[1] > $sparr1[0])) { print "$Genename\t$sparr[0]\t$ +sparr[1]\t$1\t$sparr1[0]\t$sparr1[1]\n"; } } } close(INPUT2); } } close(INPUT1);
    ----Otherwise hash will be the better option -------
    my ($havar1, $havar2); open(INPUT1, "file1"); open(INPUT2, "file2"); while (<INPUT1>) { map{push(@{$havar1->{$1}}, $_)} split('_',$2) if ($_ =~ /([^-] ++)-(.*)/); } while (<INPUT2>) { map{push(@{$havar2->{$1}}, $_)} split('_',$2) if ($_ =~ /([^-] ++)-(.*)/); } close(INPUT1); close(INPUT2); foreach my $key (keys %{$havar1}) { if (exists $havar2->{$key}) { my $cnt = @{$havar1->{$key}}; my $cnt1 = @{$havar2->{$key}}; for(my $i=0; $i < $cnt; $i=$i+2) { for (my $j=0; $j < $cnt1; $j=$j+2) { if (($havar1->{$key}->[$i] < $havar2-> +{$key}->[$j]) && ($havar1->{$key}->[$i+1] > $havar2->{$key}->[$j])) { print "$key\t$havar1->{$key}-> +[$i]\t$havar1->{$key}->[$i+1]\t$key\t$havar2->{$key}->[$j]\t$havar2-> +{$key}->[$j+1]\n"; } } } } }
      Hi brsaravan,

      Did you try the code? It seems to me that it does not print anything. I tired both versions.

      Thanks Pedro

        Sesemin,

        I posted after it worked well in my system. Did you save your input in two different files (file1 and file2)?.

        Thanks
        brsaravan
      Hi brsaravan,

      Thank you very much. It works perfect. Sorry I had to fix something to be able to run it.

      A+++

      Pedro