poegi has asked for the wisdom of the Perl Monks concerning the following question:

Good evening everyone,

I am very very new to Perl.

My challenge is: Two similar datasets (thousands of rows & 3 columns in each of them):

T2 23 43 T2 45 78 Tn 66 124 T2 34 lit T3 20 mir Tn 100 net2 Tk 1234 low3

I want to: Find #2dataset elements that are in #1dataset range.

How the workflow should go (in my opinion):Element (LetterNumber or LetterLetter) from the first column of #2dataset matches Element from the first column of #1dataset => Element's Number from the second column of #2dataset is compared to equivalent Element's Ranges (second & third column) of #1dataset => If Number (#2dataset second column) is in any Range => Code from the third column of #2dataset is printed

What I was able to create by myself:

#!/usr/bin/perl use strict; use warnings; use File::Spec; if (scalar(@ARGV) != 2) { print "Usage: perl $0 <range.txt> <code.txt>\n"; exit -1; } my @field; my @fieldd; my @start; my @end; my @numb; my $start; my $end; my $numb; my $fieldd; my $field; my @code; my $code; open(RAN, $ARGV[0]); open(COD, $ARGV[1]); while (<RAN>) { my @field=split(/\t/, $_); push(@eleR, $field[0]); push(@start, $field[1]); push(@end, $field[2]); while (<COD>) { my @fieldd=split(/\t/, $_); push(@eleC, $fieldd[0]); push(@numb, $fieldd[1]); push(@code, $fieldd[2]); foreach $eleC (@eleC) { if (grep (/$eleC/, @eleR)) { foreach $numb (@numb) { if ((@start <= $numb) && ($numb <= @en +d)) {print "$code\n"}}}}} close(RAN); close(COD);
NB: I know this was covered times -> I tried to assemble something by myself and this is the best I could. There is no result & it's incorrect, but I might be going into the right direction. I can't install any new CPAN or something (do not have root permission). There is definitely too much junk in my code, but I am afraid to delete something (it might be necessary).

I hope that someone could help me with this. I've started Perl from scratch on Monday & it's almost Friday. Can't stand such "no result wasted week".

Looking forward

Replies are listed 'Best First'.
Re: Value in the range struggling (once more)
by kcott (Archbishop) on Mar 16, 2012 at 00:04 UTC

    One of the main issues with your code is the nested while loops. With 1,000 rows in each file, that's 1,000,000 passes through the inner loop. See discussion below, With my solution (below), I've made one pass through the range file to populate a reference hash (%range) and one pass through the code file to check the reference hash (via a subroutine): that's 2,000 passes in total (assuming 1,000 rows in each file).

    I see you've checked for the right number of arguments: that's good. You haven't checked whether these are files you can read: less good. Take a look at open: note how I've used lexical (my) variables for the filehandles, the preferred 3-argument form and included some error checking (you can improve the messages).

    Here's my code:

    #!/usr/bin/env perl use strict; use warnings; if (scalar(@ARGV) != 2) { print "Usage: perl $0 <range.txt> <code.txt>\n"; exit -1; } my %range; open my $fh_range, q{<}, $ARGV[0] or die $!; while (<$fh_range>) { my @range_data = split; push @{$range{$range_data[0]}}, [@range_data[1,2]]; } close $fh_range; open my $fh_code, q{<}, $ARGV[1] or die $!; while (<$fh_code>) { check_range([split]); } close $fh_code; sub check_range { my ($key, $num, $dat) = @{+shift}; if (! exists $range{$key}) { print qq{Key [$key] not found.\n}; return; } for my $range_ref (@{$range{$key}}) { my ($min, $max) = @$range_ref; if ($num >= $min && $num <= $max) { print qq{Key[$key]-Num[$num]-Dat[$dat]}, qq{ in range Min[$min]-Max[$max]\n}; return; } } print qq{Key[$key]-Num[$num]-Dat[$dat] out of bounds.\n}; return; }

    Here's the output. Note I've added an extra dummy record to test the out of bounds condition.

    $ pm_range_code.pl pm_range.txt pm_code.txt Key[T2]-Num[34]-Dat[lit] in range Min[23]-Max[43] Key [T3] not found. Key[Tn]-Num[100]-Dat[net2] in range Min[66]-Max[124] Key [Tk] not found. Key[T2]-Num[100]-Dat[dummy] out of bounds.

    -- Ken

      With 1,000 rows in each file, that's 1,000,000 passes through the inner loop.

      Actually, the inner loop will only execute once because after the filehandle reaches end-of-file it will always return undef.

        ++jwkrahn - most astute :-)

        poegi,

        I didn't look deeply at the mechanics of your code. I made an incorrect assumption with respect to my calculations. Nonetheless, I'm sure you grasp the general principle that nested loops can increase the number of operations exponentially. I'll update my post.

        -- Ken

Re: Value in the range struggling (once more)
by jwkrahn (Abbot) on Mar 15, 2012 at 23:05 UTC

    Perhaps this will help:

    #!/usr/bin/perl use warnings; use strict; use Inline::Files; my %ranges; while ( <RANGE> ) { next unless /\S/; my ( $key, $start, $end ) = split; push @{ $ranges{ $key } }, [ $start, $end ]; } while ( <CODE> ) { next unless /\S/; my ( $key, $code ) = split; if ( exists $ranges{ $key } && grep $code >= $_->[ 0 ] && $code <= + $_->[ 1 ], @{ $ranges{ $key } } ) { print; } } __RANGE__ T2 23 43 T2 45 78 Tn 66 124 __CODE__ T2 34 lit T3 20 mir Tn 100 net2 Tk 1234 low3

    Produces the result:

    T2 34 lit Tn 100 net2
Re: Value in the range struggling (once more)
by poegi (Initiate) on Mar 16, 2012 at 00:54 UTC

    Thanks for your responses - it works!

    Thanks for the commentary on my mistakes - I will improve.