in reply to Comparing 2-D co-ordinates

You can avoid the need to sort or even do any comparisons. Just loop through the data and build a string using one character to represent your ranges and another the 'freespace'. Then just count the non-freespace chars and your done.

#! perl -slw use strict; my @data = map{ [ split "','", substr( $_, 1, -2 ) ] } <DATA>; my $mask = ' ' x 10_000; # Make a blank string that's 'big enough' my $max = 0; # we'll trim it back later for( @data ) { #calc the length of the range my $len = $_->[2] - $_->[1] +1; # overwrite the range with 'xx's substr( $mask, $_->[1], $len ) = 'x' x $len; # Remember the highest offset $max = $_->[2] if $_->[2] > $max; } $mask = substr( $mask, 0, $max ); # Trim to length my $coverage = $mask =~ tr[x][x]; # Count the 'x's printf "cover = %.1f%% \n", $coverage / length($mask) *100; __DATA__ 'NM_176827','618','710' 'NM_176827','621','710' 'NM_176827','622','692' 'NM_176827','629','710'

Output

P:\test>279587 cover = 13.0%

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Replies are listed 'Best First'.
Re: Re: Comparing 2-D co-ordinates
by Anonymous Monk on Jul 31, 2003 at 15:20 UTC
    Im not sure I've understood your full specs. Whenever two matches overlap, do you want to discard the shorter? Even in this case?:
    <----> <-----> <---->
    Here the two shorter matches together give more coverage than the longest. Another structure to represent this data is a list in which each range occurs twice, a lexically sorted list of pairs, (start,end) and (end,start). But before choosing your data structure, you have to know what question you wan to answer. Chris Morris
      Chris,

      Thanks for helping clarify my problem. I want to find the set of discrete hits that gives me the best coverage (if that make sense?).

      In your example above (ah the beauty of acsii graphics) I would like to pick the two shorter examples as combined they give me a higher overall coverage.

      I apologise for any confusion that I am causing - I am having a bad day - too hot and sticky in here!

      A.A.

        Not as neat as my first attempt as this requires a sort, but then, maybe this one actually does what you want:)

        #! perl -slw use strict; my @data = map{ [ split "','", substr( $_, 1, -2 ) ] } <DATA>; my $mask = ' ' x 10_000; my $max = 0; # Sort the data to put longest ranges first (could be more efficient:) @data = sort{ $b->[2] - $b->[1] <=> $a->[2] - $a->[1] } @data; for( @data ) { # Calc range my $len = $_->[2] - $_->[1] +1; # Test to see if this range is already covered? if( substr( $mask, $_->[1], $len ) !~ m[ ] ) { $_ = undef; # if so, delete next; } # Other wise add it to the mask substr( $mask, $_->[1], $len ) = 'x' x $len; # Remeber the highest pos. $max = $_->[2] if $_->[2] > $max; } # Trim the string $mask = substr( $mask, 0, $max ); # Count coverage my $coverage = $mask =~ tr[x][x]; # Calc. percentage printf "cover = %.1f%% \n", $coverage / length($mask) *100; print 'Using:'; # Display contributing ranges having discarded deleted elements. print "@$_" for grep{ $_ } @data; __DATA__ 'NM_176827','618','692' 'NM_176827','621','710' 'NM_176827','622','692' 'NM_176827','629','710'

        Output

        P:\test>279587-3 cover = 13.0% Using: NM_176827 621 710 NM_176827 618 692

        I've also got a version that does this using vec instead of substr, which saves 7/8 of the space, but runs much more slowly.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

        You want the to take the smallest number of your original segments which, combined, cover as much of the range as possible, with no overlaps. This sounds like a solved problem. ie; check some algorythm books/websites

        --
        I'm not belgian but I play one on TV.

Re: Re: Comparing 2-D co-ordinates
by aging acolyte (Pilgrim) on Jul 31, 2003 at 15:19 UTC
    Thanks for the suggestions,

    But both Browser's and fglock suggestions are greedy - ie they grab the maximum range. I want the largest discrete answer. To munge my data a little from above. If the data were:

    'NM_176827','620','710' 'NM_176827','621','710' 'NM_176827','618','692' 'NM_176827','629','710'
    The larget individual result is still number 1 but the range is still 618-710 (both the above return the range of coverage rather than "the" best result.

    Thanks again

    A.A.

      Sorry. I'm still misunderstanding you. Given this sample dataset, what 'best result' do you want?

      The index of the largest range?

      The %coverage by that largest range?

      Something else?

      Ie. What output do you want given that input, and how is it derived?


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller