Qurey through array of array

Replies are listed 'Best First'.
Re: Query through array of array by puudeli (Pilgrim) on Jan 22, 2009 at 10:12 UTC
But you can do it with hashes! Just use two hashes, one for each file. Then you can use the keys from data file 1 to find out whether corresponding key exists in the hash from data file 2. Below is a quick example: #! /usr/bin/perl use strict; use warnings; my @d1 = qw{ CLS_S3_Contig2721-139_168 CLS_S3_Contig2722-375_390 CLS_S3_Contig2725-323_362 CLS_S3_Contig2725-455_480 CLS_S3_Contig2728-117_144 CLS_S3_Contig2728-437_472 CLS_S3_Contig2729-119_130 CLS_S3_Contig2729-163_220 CLS_S3_Contig2730-181_202 CLS_S3_Contig2730-361_384 CLS_S3_Contig2731-824_843 CLS_S3_Contig2731-1150_1201 CLS_S3_Contig2735-571_636 CLS_S3_Contig2735-677_710 CLS_S3_Contig2735-775_810 }; my @d2 = qw{ CLS_S3_Contig2721-142_169 CLS_S3_Contig6525-509_514 CLS_S3_Contig6525-493_502 CLS_S3_Contig6525-503_508 CLS_S3_Contig2977-365_376 CLS_S3_Contig2977-77_82 CLS_S3_Contig2977-83_90 CLS_S3_Contig4978-271_274 CLS_S3_Contig4978-385_388 CLS_S3_Contig2730-365_389 }; my %d_one = (); my %d_two = (); for( @d1 ) { my ($key, $start, $end) = $_ =~ /(.)-(\d+)_(\d+)/; $d_one{$key}{'start'} = $start; $d_one{$key}{'end'} = $end; } for( @d2 ) { my ($key, $start, $end) = $_ =~ /(.)-(\d+)_(\d+)/; $d_two{$key}{'start'} = $start; $d_two{$key}{'end'} = $end; } print "Genename(file1) start end Genename(file2) start end\ +n\n"; foreach my $key_one ( keys %d_one ) { if( $d_two{$key_one} ) { if( $d_two{$key_one}{'start'} >= $d_one{$key_one}{'start'} && $d_two{$key_one}{'start'} <= $d_one{$key_one}{'end'} ) { print "$key_one $d_one{$key_one}{'start'} $d_one{$ +key_one}{'end'}" . " " . "$key_one $d_two{$key_one}{'start'} $d_two{$ +key_one}{'end'}\n"; } } } [download] Note, the limits are inclusive in my code. It produces the following output: `cpt2jmo@phantom:~$ ./2_hashes.pl Genename(file1) start end Genename(file2) start end CLS_S3_Contig2721 139 168 CLS_S3_Contig2721 142 169 CLS_S3_Contig2730 361 384 CLS_S3_Contig2730 365 389` [download] Update*: I removed the readmore -tags, since the code is not that* long. -- `seek $her, $from, $everywhere if exists $true{love};`	[reply] [d/l] [select]
Re^2: Qurey through array of array by sesemin (Beadle) on Jan 22, 2009 at 22:03 UTC
Hi puudeli, Thank you very much for your nice code. I am just wondering if it works if there are redundant keys. As you can see in my input files the gene names may repeat over and over. This will be in jeopardy of key concept of hash in perl? thanks again	[reply]
Re^3: Query through array of array by puudeli (Pilgrim) on Jan 23, 2009 at 06:43 UTC
Yes, that would ruin my approach. In this case I would introduce and array to the hash. It complicates things, but only a little bit :-) Update: Corrected a wrong hash key in the printout. Update2: Removed incorrect `next` statement. Here is the modified version. The algo can get heavy if the data is large, so maybe wiser monks could help a bit here? #! /usr/bin/perl use strict; use warnings; my @d1 = qw{ CLS_S3_Contig2721-139_168 CLS_S3_Contig2722-375_390 CLS_S3_Contig2725-323_362 CLS_S3_Contig2725-455_480 CLS_S3_Contig2728-117_144 CLS_S3_Contig2728-437_472 CLS_S3_Contig2729-119_130 CLS_S3_Contig2729-163_220 CLS_S3_Contig2730-181_202 CLS_S3_Contig2730-361_384 CLS_S3_Contig2731-824_843 CLS_S3_Contig2731-1150_1201 CLS_S3_Contig2735-571_636 CLS_S3_Contig2735-677_710 CLS_S3_Contig2735-775_810 }; my @d2 = qw{ CLS_S3_Contig2721-142_169 CLS_S3_Contig6525-509_514 CLS_S3_Contig6525-493_502 CLS_S3_Contig6525-503_508 CLS_S3_Contig2977-365_376 CLS_S3_Contig2977-77_82 CLS_S3_Contig2977-83_90 CLS_S3_Contig4978-271_274 CLS_S3_Contig4978-385_388 CLS_S3_Contig2730-365_389 }; { my %d_one = (); my %d_two = (); for( @d1 ) { my ($key, $start, $end) = $_ =~ /(.)-(\d+)_(\d+)/; push @{ $d_one{$key} }, { 'start' => $start, 'end' => $end }; } for( @d2 ) { my ($key, $start, $end) = $_ =~ /(.)-(\d+)_(\d+)/; push @{ $d_two{$key} }, { 'start' => $start, 'end' => $end }; + } use Data::Dumper; print Dumper(\%d_one, \%d_two); print "Genename(file1)\t\tstart\tend\t\tGenename(file2)\t\tstart +\tend\n\n"; foreach my $key_one ( keys %d_one ) { if( $d_two{$key_one} ) { # We found a matching key! foreach my $limits_one ( @{ $d_one{$key_one} } ) { foreach my $limits_two ( @{ $d_two{$key_one} } ) { if( $limits_two->{'start'} >= $limits_one->{'start +'} && $limits_two->{'start'} <= $limits_one->{'end'} + ) { print "$key_one\t$limits_one->{'start'}\t$ +limits_one->{'end'}" . "\t\t" . "$key_one\t$limits_two->{'start'}\t$ +limits_two->{'end'}\n"; } } } } } } [download] Note, write test cases for this or test otherwise extensively. Below is the current output of my script: cpt2jmo@phantom:~/misc$ ./2_hashes_with_arrays.pl $VAR1 = { 'CLS_S3_Contig2722' => [ { 'end' => '390', 'start' => '375' } ], 'CLS_S3_Contig2725' => [ { 'end' => '362', 'start' => '323' }, { 'end' => '480', 'start' => '455' } ], 'CLS_S3_Contig2729' => [ { 'end' => '130', 'start' => '119' }, { 'end' => '220', 'start' => '163' } ], 'CLS_S3_Contig2731' => [ { 'end' => '843', 'start' => '824' }, { 'end' => '1201', 'start' => '1150' } ], 'CLS_S3_Contig2728' => [ { 'end' => '144', 'start' => '117' }, { 'end' => '472', 'start' => '437' } ], 'CLS_S3_Contig2721' => [ { 'end' => '168', 'start' => '139' } ], 'CLS_S3_Contig2735' => [ { 'end' => '636', 'start' => '571' }, { 'end' => '710', 'start' => '677' }, { 'end' => '810', 'start' => '775' } ], 'CLS_S3_Contig2730' => [ { 'end' => '202', 'start' => '181' }, { 'end' => '384', 'start' => '361' } ] }; $VAR2 = { 'CLS_S3_Contig2721' => [ { 'end' => '169', 'start' => '142' } ], 'CLS_S3_Contig6525' => [ { 'end' => '514', 'start' => '509' }, { 'end' => '502', 'start' => '493' }, { 'end' => '508', 'start' => '503' } ], 'CLS_S3_Contig2730' => [ { 'end' => '389', 'start' => '365' } ], 'CLS_S3_Contig4978' => [ { 'end' => '274', 'start' => '271' }, { 'end' => '388', 'start' => '385' } ], 'CLS_S3_Contig2977' => [ { 'end' => '376', 'start' => '365' }, { 'end' => '82', 'start' => '77' }, { 'end' => '90', 'start' => '83' } ] }; Genename(file1) start end Genename(file2) + start end CLS_S3_Contig2721 139 168 CLS_S3_Contig2721 + 142 169 CLS_S3_Contig2730 361 384 ** CLS_S3_Contig2730 + 365 389 [download] -- `seek $her, $from, $everywhere if exists $true{love};`	[reply] [d/l] [select]
Re^4: Qurey through array of array by sesemin (Beadle) on Jan 23, 2009 at 07:42 UTC
Re^5: Query through array of array by puudeli (Pilgrim) on Jan 23, 2009 at 08:15 UTC
Re: Qurey through array of array by rovf (Priest) on Jan 22, 2009 at 13:08 UTC
Obviously my code does not work I find that your code doesn't provide much safety net, so you have a hard time seeing where it fails. for instance, you split on tabs. Can you be sure, that the fields will really be separated by tab characters always (and not sometimes by spaces)? Are you sure that a line can't have embedded spaces? And if your line does not satisfy the regexp pattern, is it OK to silently ignore that line, or should there be printed an error message? Further, your program contains a few variables (for instance, $from_tg) which are never initialized, so I guess you didn't provide the whole code. And, are you really using strict and warnings in your original code? -- Ronald Fischer <ynnor@mm.st>	[reply]
Re^2: Qurey through array of array by sesemin (Beadle) on Jan 22, 2009 at 21:49 UTC
Hi Ronald, Thanks for the comments. I am sure the the fields are tab separated and there will not be any white line in between the lines. The main problem is to query through the first array of array (INPUT1) and find the items that meet the conditions set. Thank you again. Pedro	[reply]
Re: Qurey through array of array by brsaravan (Scribe) on Jan 22, 2009 at 13:07 UTC
If your file size is small then you can use this simple approach. `open(INPUT1, "file1"); my ($havar1, $havar2i, $Genename); while (<INPUT1>) { my $startvar; if ($_ =~ /([^-]+)-(.)/) { $Genename = $1; my @sparr = split('_', $2); open(INPUT2, "file2"); while(<INPUT2>) { if ($_ =~ /([^-]+)-(.)/) { my @sparr1 = split('_', $2); if (($Genename eq $1) && ($sparr[0] < +$sparr1[0]) && ($sparr[1] > $sparr1[0])) { print "$Genename\t$sparr[0]\t$ +sparr[1]\t$1\t$sparr1[0]\t$sparr1[1]\n"; } } } close(INPUT2); } } close(INPUT1);` [download] ----Otherwise hash will be the better option ------- my ($havar1, $havar2); open(INPUT1, "file1"); open(INPUT2, "file2"); while (<INPUT1>) { map{push(@{$havar1->{$1}}, $_)} split('_',$2) if ($_ =~ /([^-] ++)-(.)/); } while (<INPUT2>) { map{push(@{$havar2->{$1}}, $_)} split('_',$2) if ($_ =~ /([^-] ++)-(.)/); } close(INPUT1); close(INPUT2); foreach my $key (keys %{$havar1}) { if (exists $havar2->{$key}) { my $cnt = @{$havar1->{$key}}; my $cnt1 = @{$havar2->{$key}}; for(my $i=0; $i < $cnt; $i=$i+2) { for (my $j=0; $j < $cnt1; $j=$j+2) { if (($havar1->{$key}->[$i] < $havar2-> +{$key}->[$j]) && ($havar1->{$key}->[$i+1] > $havar2->{$key}->[$j])) { print "$key\t$havar1->{$key}-> +[$i]\t$havar1->{$key}->[$i+1]\t$key\t$havar2->{$key}->[$j]\t$havar2-> +{$key}->[$j+1]\n"; } } } } } [download]	[reply] [d/l] [select]
Re^2: Qurey through array of array by sesemin (Beadle) on Jan 22, 2009 at 22:36 UTC
Hi brsaravan, Did you try the code? It seems to me that it does not print anything. I tired both versions. Thanks Pedro	[reply]
Re^3: Qurey through array of array by brsaravan (Scribe) on Jan 23, 2009 at 05:25 UTC
Sesemin, I posted after it worked well in my system. Did you save your input in two different files (file1 and file2)?. Thanks brsaravan	[reply]
Re^2: Qurey through array of array by sesemin (Beadle) on Jan 23, 2009 at 06:49 UTC
Hi brsaravan, Thank you very much. It works perfect. Sorry I had to fix something to be able to run it. A+++ Pedro	[reply]

But you can do it with hashes! Just use two hashes, one for each file. Then you can use the keys from data file 1 to find out whether corresponding key exists in the hash from data file 2. Below is a quick example:

#! /usr/bin/perl

use strict;
use warnings;

my @d1 = qw{
        CLS_S3_Contig2721-139_168
        CLS_S3_Contig2722-375_390
        CLS_S3_Contig2725-323_362
        CLS_S3_Contig2725-455_480
        CLS_S3_Contig2728-117_144
        CLS_S3_Contig2728-437_472
        CLS_S3_Contig2729-119_130
        CLS_S3_Contig2729-163_220
        CLS_S3_Contig2730-181_202
        CLS_S3_Contig2730-361_384
        CLS_S3_Contig2731-824_843
        CLS_S3_Contig2731-1150_1201
        CLS_S3_Contig2735-571_636
        CLS_S3_Contig2735-677_710
        CLS_S3_Contig2735-775_810
    };

my @d2 = qw{
        CLS_S3_Contig2721-142_169
        CLS_S3_Contig6525-509_514
        CLS_S3_Contig6525-493_502
        CLS_S3_Contig6525-503_508
        CLS_S3_Contig2977-365_376
        CLS_S3_Contig2977-77_82
        CLS_S3_Contig2977-83_90
        CLS_S3_Contig4978-271_274
        CLS_S3_Contig4978-385_388
        CLS_S3_Contig2730-365_389
    };

my %d_one = ();
my %d_two = ();

for( @d1 ) {
    my ($key, $start, $end) = $_ =~ /(.*)-(\d+)_(\d+)/;
    $d_one{$key}{'start'}   = $start;
    $d_one{$key}{'end'}     = $end;
}

for( @d2 ) {
    my ($key, $start, $end) = $_ =~ /(.*)-(\d+)_(\d+)/;
    $d_two{$key}{'start'}   = $start;
    $d_two{$key}{'end'}     = $end;
}

print "Genename(file1) start  end ** Genename(file2)     start    end\
+n\n";

foreach my $key_one ( keys %d_one ) {
    if( $d_two{$key_one} ) {
        if( $d_two{$key_one}{'start'} >= $d_one{$key_one}{'start'} &&
            $d_two{$key_one}{'start'} <= $d_one{$key_one}{'end'} ) {
                    print "$key_one $d_one{$key_one}{'start'} $d_one{$
+key_one}{'end'}"
                        . "** "
                        . "$key_one $d_two{$key_one}{'start'} $d_two{$
+key_one}{'end'}\n";
                        
            }
    }
}
[download]

cpt2jmo@phantom:~$ ./2_hashes.pl
Genename(file1) start  end ** Genename(file2)     start    end

CLS_S3_Contig2721 139 168** CLS_S3_Contig2721 142 169
CLS_S3_Contig2730 361 384** CLS_S3_Contig2730 365 389
[download]

Update: I removed the readmore -tags, since the code is not that long.

--
seek $her, $from, $everywhere if exists $true{love};

[reply]
[d/l]
[select]

Thank you very much for your nice code. I am just wondering if it works if there are redundant keys. As you can see in my input files the gene names may repeat over and over. This will be in jeopardy of key concept of hash in perl?

thanks again

[reply]

Yes, that would ruin my approach. In this case I would introduce and array to the hash. It complicates things, but only a little bit :-)

Update: Corrected a wrong hash key in the printout.

Update2: Removed incorrect next statement.

Here is the modified version. The algo can get heavy if the data is large, so maybe wiser monks could help a bit here?

#! /usr/bin/perl

use strict;
use warnings;

my @d1 = qw{
        CLS_S3_Contig2721-139_168
        CLS_S3_Contig2722-375_390
        CLS_S3_Contig2725-323_362
        CLS_S3_Contig2725-455_480
        CLS_S3_Contig2728-117_144
        CLS_S3_Contig2728-437_472
        CLS_S3_Contig2729-119_130
        CLS_S3_Contig2729-163_220
        CLS_S3_Contig2730-181_202
        CLS_S3_Contig2730-361_384
        CLS_S3_Contig2731-824_843
        CLS_S3_Contig2731-1150_1201
        CLS_S3_Contig2735-571_636
        CLS_S3_Contig2735-677_710
        CLS_S3_Contig2735-775_810
    };

my @d2 = qw{
        CLS_S3_Contig2721-142_169
        CLS_S3_Contig6525-509_514
        CLS_S3_Contig6525-493_502
        CLS_S3_Contig6525-503_508
        CLS_S3_Contig2977-365_376
        CLS_S3_Contig2977-77_82
        CLS_S3_Contig2977-83_90
        CLS_S3_Contig4978-271_274
        CLS_S3_Contig4978-385_388
        CLS_S3_Contig2730-365_389
    };

{
    my %d_one = ();
    my %d_two = ();

    for( @d1 ) {
        my ($key, $start, $end) = $_ =~ /(.*)-(\d+)_(\d+)/;
        push @{ $d_one{$key} }, { 'start' => $start, 'end' => $end };
    }

    for( @d2 ) {
        my ($key, $start, $end) = $_ =~ /(.*)-(\d+)_(\d+)/;
        push @{ $d_two{$key} }, { 'start' => $start, 'end' => $end }; 
+   
    }

    use Data::Dumper;
    
    print Dumper(\%d_one, \%d_two);

    print "Genename(file1)\t\tstart\tend\t**\tGenename(file2)\t\tstart
+\tend\n\n";

    foreach my $key_one ( keys %d_one ) {
        
        if( $d_two{$key_one} ) { # We found a matching key!
            
            foreach my $limits_one ( @{ $d_one{$key_one} } ) {
                
                foreach my $limits_two ( @{ $d_two{$key_one} } ) {
                    
                    if( $limits_two->{'start'} >= $limits_one->{'start
+'} &&
                        $limits_two->{'start'} <= $limits_one->{'end'}
+ ) {
                            
                            print "$key_one\t$limits_one->{'start'}\t$
+limits_one->{'end'}"
                                . "\t**\t"
                                . "$key_one\t$limits_two->{'start'}\t$
+limits_two->{'end'}\n";                                
                        }
                }
            }            
        }
    }
}
[download]

cpt2jmo@phantom:~/misc$ ./2_hashes_with_arrays.pl
$VAR1 = {
          'CLS_S3_Contig2722' => [
                                   {
                                     'end' => '390',
                                     'start' => '375'
                                   }
                                 ],
          'CLS_S3_Contig2725' => [
                                   {
                                     'end' => '362',
                                     'start' => '323'
                                   },
                                   {
                                     'end' => '480',
                                     'start' => '455'
                                   }
                                 ],
          'CLS_S3_Contig2729' => [
                                   {
                                     'end' => '130',
                                     'start' => '119'
                                   },
                                   {
                                     'end' => '220',
                                     'start' => '163'
                                   }
                                 ],
          'CLS_S3_Contig2731' => [
                                   {
                                     'end' => '843',
                                     'start' => '824'
                                   },
                                   {
                                     'end' => '1201',
                                     'start' => '1150'
                                   }
                                 ],
          'CLS_S3_Contig2728' => [
                                   {
                                     'end' => '144',
                                     'start' => '117'
                                   },
                                   {
                                     'end' => '472',
                                     'start' => '437'
                                   }
                                 ],
          'CLS_S3_Contig2721' => [
                                   {
                                     'end' => '168',
                                     'start' => '139'
                                   }
                                 ],
          'CLS_S3_Contig2735' => [
                                   {
                                     'end' => '636',
                                     'start' => '571'
                                   },
                                   {
                                     'end' => '710',
                                     'start' => '677'
                                   },
                                   {
                                     'end' => '810',
                                     'start' => '775'
                                   }
                                 ],
          'CLS_S3_Contig2730' => [
                                   {
                                     'end' => '202',
                                     'start' => '181'
                                   },
                                   {
                                     'end' => '384',
                                     'start' => '361'
                                   }
                                 ]
        };
$VAR2 = {
          'CLS_S3_Contig2721' => [
                                   {
                                     'end' => '169',
                                     'start' => '142'
                                   }
                                 ],
          'CLS_S3_Contig6525' => [
                                   {
                                     'end' => '514',
                                     'start' => '509'
                                   },
                                   {
                                     'end' => '502',
                                     'start' => '493'
                                   },
                                   {
                                     'end' => '508',
                                     'start' => '503'
                                   }
                                 ],
          'CLS_S3_Contig2730' => [
                                   {
                                     'end' => '389',
                                     'start' => '365'
                                   }
                                 ],
          'CLS_S3_Contig4978' => [
                                   {
                                     'end' => '274',
                                     'start' => '271'
                                   },
                                   {
                                     'end' => '388',
                                     'start' => '385'
                                   }
                                 ],
          'CLS_S3_Contig2977' => [
                                   {
                                     'end' => '376',
                                     'start' => '365'
                                   },
                                   {
                                     'end' => '82',
                                     'start' => '77'
                                   },
                                   {
                                     'end' => '90',
                                     'start' => '83'
                                   }
                                 ]
        };
Genename(file1)         start   end     **      Genename(file2)       
+  start   end

CLS_S3_Contig2721       139     168     **      CLS_S3_Contig2721     
+  142     169
CLS_S3_Contig2730       361     384     **      CLS_S3_Contig2730     
+  365     389
[download]

--
seek $her, $from, $everywhere if exists $true{love};

[reply]
[d/l]
[select]

Obviously my code does not work

I find that your code doesn't provide much safety net, so you have a hard time seeing where it fails. for instance, you split on tabs. Can you be sure, that the fields will really be separated by tab characters always (and not sometimes by spaces)? Are you sure that a line can't have embedded spaces? And if your line does not satisfy the regexp pattern, is it OK to silently ignore that line, or should there be printed an error message?

Further, your program contains a few variables (for instance, $from_tg) which are never initialized, so I guess you didn't provide the whole code.

And, are you really using strict and warnings in your original code?

--
Ronald Fischer <ynnor@mm.st>

[reply]

Thanks for the comments. I am sure the the fields are tab separated and there will not be any white line in between the lines. The main problem is to query through the first array of array (INPUT1) and find the items that meet the conditions set. Thank you again.

Pedro

[reply]

open(INPUT1, "file1");

my ($havar1, $havar2i, $Genename);

while (<INPUT1>) {
        my $startvar;
        if ($_ =~ /([^-]+)-(.*)/) {
                $Genename = $1;
                my @sparr = split('_', $2);
                open(INPUT2, "file2");
                while(<INPUT2>) {
                        if ($_ =~ /([^-]+)-(.*)/) {
                                my @sparr1 = split('_', $2);
                                if (($Genename eq $1) && ($sparr[0] < 
+$sparr1[0]) && ($sparr[1] > $sparr1[0])) {
                                        print "$Genename\t$sparr[0]\t$
+sparr[1]\t$1\t$sparr1[0]\t$sparr1[1]\n";
                                }
                        }
                }
                close(INPUT2);
        }
}
close(INPUT1);
[download]

my ($havar1, $havar2);
open(INPUT1, "file1");
open(INPUT2, "file2");

while (<INPUT1>) {
        map{push(@{$havar1->{$1}}, $_)} split('_',$2) if ($_ =~ /([^-]
++)-(.*)/);
}
while (<INPUT2>) {
        map{push(@{$havar2->{$1}}, $_)} split('_',$2) if ($_ =~ /([^-]
++)-(.*)/);
}
close(INPUT1);
close(INPUT2);

foreach my $key (keys %{$havar1}) {
        if (exists $havar2->{$key}) {
                my $cnt = @{$havar1->{$key}};
                my $cnt1 =  @{$havar2->{$key}};
                for(my $i=0; $i < $cnt; $i=$i+2) {
                        for (my $j=0; $j < $cnt1; $j=$j+2) {
                                if (($havar1->{$key}->[$i] < $havar2->
+{$key}->[$j]) && ($havar1->{$key}->[$i+1] > $havar2->{$key}->[$j])) {
                                        print "$key\t$havar1->{$key}->
+[$i]\t$havar1->{$key}->[$i+1]\t$key\t$havar2->{$key}->[$j]\t$havar2->
+{$key}->[$j+1]\n";
                                }
                        }
                }
        }
}
[download]

[reply]
[d/l]
[select]

Did you try the code? It seems to me that it does not print anything. I tired both versions.

Thanks Pedro

[reply]

Thank you very much. It works perfect. Sorry I had to fix something to be able to run it.

A+++

Pedro

[reply]