in reply to Re: list of unique strings, also eliminating matching substrings
in thread list of unique strings, also eliminating matching substrings

1. I want to eliminate the duplicates within each file.

2. The strings range from 200 to 400 characters.

3. The complete alphabet is A, G, C, T, N.

Note for point 1. I want to eliminate not just the exact duplicates but also those that are contained within a longer string.

  • Comment on Re^2: list of unique strings, also eliminating matching substrings

Replies are listed 'Best First'.
Re^3: list of unique strings, also eliminating matching substrings
by BrowserUk (Patriarch) on May 21, 2011 at 05:53 UTC

    Presumably you've code the obvious two loops method and it is taking too long. Could you supply a timing for one of your datasets?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      7 hours! starting with about 200,00 strings. Thanks!

      7 hours! for about 200,000 starting strings.

        Try this on that same dataset and let me know how you get on. On my generated test data it takes ~30 minutes for 200,000 strings, but how realistic that dataset is ...

        Use thisScript.pl inFile > outFile:

        #! perl -slw use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => '_906020', CLEAN_AFTER_BUILD => 0; #include <string.h> int longCmp( SV *needle, SV *haystack, SV *offset ) { STRLEN ln, lh, o = SvIV( offset ); char *n = SvPV( needle, ln ), *h = SvPV( haystack, lh ); char *nl = n + ln - 1; int diff = lh - ln; int flag = 0, i; h += o; lh -= o; diff -= o; if( diff <= 0 ) return 0; for( i = 0; i < diff; ++i ) { if( ! h[ i + ln - 1 ] ) { i += ln; continue; } if( h[ i ] != *n || h[ i+ ln-1 ] != *nl ) continue; if( strncmp( h+i, n, ln ) ) continue; return i; } return 0; } END_C use Time::HiRes qw[ time ]; sub uniq{ my %x; @x{@_} = (); keys %x } my $start = time; my @uniq = uniq <>; chomp @uniq; @uniq = sort{ length $a <=> length $b } @uniq; my $all = join chr(0), @uniq; my $p = 0; for my $x ( @uniq ) { $p += 1+ length $x; next if longCmp( $x, $all, $p ); print $x; } printf STDERR "Took %.3f\n", time() - $start; __END__

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.