Re^10: list of unique strings, also eliminating matching substrings

hmm ... actually I didn't want to invest time coding for this xy-nonsense.

But maybe it's easier to show code instead of explaining the theory

(UPDATE-06-03 13:37 GMT: fixed bug in initialization and missing uniqifying)

--- Init ...
 ... completed: 2000 snippets from 200 to 400 of 6000 long DNA
Check_all      took 1.206    Filtered: 287 Ratio: 6.96864111498258
Check_longest      took 0.436    Filtered: 287 Ratio: 6.96864111498258
Faster: 2.76641019482413

--- Init ...
 ... completed: 2000 snippets from 200 to 400 of 60000 long DNA
Check_all      took 2.990    Filtered: 924 Ratio: 2.16450216450216
Check_longest      took 1.908    Filtered: 924 Ratio: 2.16450216450216
Faster: 1.56725918185693

--- Init ...
 ... completed: 2000 snippets from 200 to 400 of 600000 long DNA
Check_all      took 4.462    Filtered: 1785 Ratio: 1.12044817927171
Check_longest      took 4.058    Filtered: 1785 Ratio: 1.1204481792717
+1
Faster: 1.09963952407683

--- Init ...
 ... completed: 2000 snippets from 200 to 400 of 6000000 long DNA
Check_all      took 4.601    Filtered: 1977 Ratio: 1.01163378856854
Check_longest      took 4.651    Filtered: 1977 Ratio: 1.0116337885685
+4
Faster: 0.989378646160973
[download]

#! perl -slw
use strict;
use Time::HiRes qw[ time ];
use Data::Dump qw[ pp ];

use constant DEBUG => 0;

$|=1;



my (@sequences,@uniq);

for my $magnitude (0..3) {
  my $factor = 10**$magnitude;
  
  init_sequences(6000*$factor,2000,200,400);
  #init_sequences(6000,20*$factor,200,400);
  pp \@sequences if DEBUG;


  my ($res_all,$time_all) = check_all();
  pp $res_all if DEBUG;

  my ($res_longest,$time_longest) = check_longest();
  pp $res_longest if DEBUG;

  print STDERR "Faster: ", $time_all/ $time_longest;
}




# ---------
#  subs
# ---------



sub uniq{ my %x; @x{@_} = (); keys %x }


sub check_all {
  my $start = time;
  my @uniq = uniq(@sequences);
  my @result;
  chomp @uniq;

  @uniq = sort{ length $a <=> length $b } @uniq;

  my $all = join chr(0), @uniq;
  my $p = 0;

  for my $x ( @uniq ) {
    $p += 1+ length $x;
    next if 1+ index $all, $x, $p;      ## COrrected per LanX below.
    push @result, $x ;
  }

  my $runtime=time() - $start;
  printf STDERR "Check_all \t took %.3f\t",$runtime;
  print STDERR "Filtered: ",scalar @result ," Ratio: ",@sequences /  @
+result;
  return \@result,$runtime;
}


sub check_longest {
  my $start = time;
  my @uniq = uniq(@sequences);
  chomp @uniq;

  @uniq = sort{ length $b <=> length $a } @uniq;

  my $longest= shift @uniq;

  for my $x ( @uniq ) {
    next if 1+ index $longest, $x;
    $longest .= "\n" . $x;
  }
  
  my $runtime=time() - $start;
  printf STDERR "Check_longest \t took %.3f\t",$runtime; 
  
  my @result=split "\n",$longest;
  print STDERR "Filtered: ",scalar @result ," Ratio: ",@sequences /  @
+result;

  return \@result,$runtime;

}

sub init_sequences {
  my $length_dna=shift;
  my $num=shift;
  my $min=shift;
  my $max=shift;

  print STDERR "\n--- Init ...";
  
  my $dna;
  @sequences=();    
  for (1 .. $length_dna) {
    $dna .= (qw/A C G T N/)[int(rand 5)]
  }
  print length $dna,": ",$dna if DEBUG;

  for (1 .. $num) {
    my $length= $min + int ( rand ($max -$min +1) );
    my $offset= int ( rand ($length_dna - $length +1) );
    # pp $length,$offset;
    push @sequences,  substr $dna,$offset,$length;
  }
  print STDERR " ... completed: $num snippets from $min to $max of $le
+ngth_dna long DNA";
}
[download]

Cheers Rolf

Comment on Re^10: list of unique strings, also eliminating matching substrings Select or Download Code

Replies are listed 'Best First'.
Re^11: list of unique strings, also eliminating matching substrings by BrowserUk (Patriarch) on Jun 03, 2011 at 14:38 UTC
hmm ... actually I didn't want to invest time coding for this xy-nonsense. Sorry, but no one forced you to. You asked questions, I did my best to answer them. That''s all. FWIW: Converted to a form that allows it to be used in a realistic and repeatable scenario: `#! perl -slw use strict; use Time::HiRes qw[ time ]; $\|++; sub uniq{ my %x; @x{@_} = (); keys %x } my $start = time; my @uniq = uniq <>; chomp @uniq; @uniq = sort{ length $b <=> length $a } @uniq; my $longest = shift @uniq; for my $x ( @uniq ) { next if 1+ index $longest, $x; print $x; $longest .= "\n" . $x; } printf STDERR "Took %.3f\n", time() - $start;` [download] Whilst you're ~10% quicker on low numbers of strings: `c:\test>906020 906020.10e3 > 906020.filtered Took 48.854 c:\test>wc -l 906020.filtered 5000 906020.filtered c:\test>906020-lanx 906020.10e3 > lanx.filtered Took 43.122 c:\test>wc -l lanx.filtered 4999 lanx.filtered c:\test>906020 906020.10e3 > 906020.filtered (inline version) Took 21.744 c:\test>wc -l 906020.filtered 5000 906020.filtered` [download] As your own timings show, as the numbers of strings increase, the cost of constantly reallocating your accumulator string in order to append the new one starts to dominate. I suspect that by the time you get to the OPs 200,000 strings you going to be considerably slower. (You also have an out-by-one error somewhere, but that is probably easily fixed.) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^12: list of unique strings, also eliminating matching substrings by LanX (Saint) on Jun 03, 2011 at 15:02 UTC
ATM I can't see any influence of string allocation. It's simple your version's performance is proportional to the input size, mine is proportional to output. In rare cases where there are almost no strings to be excluded - i.e. output nearly input - my version can be slightly slower. In other words if only 10% of all sequences remain after filtering, my algo is about 10 times faster. BTW: using "\n" instead of chr(0) was a stupid idea, the extra byte is expensive. Cheers Rolf	[reply]
Re^13: list of unique strings, also eliminating matching substrings by BrowserUk (Patriarch) on Jun 03, 2011 at 15:20 UTC
ATM I can't see any influence of string allocation. Try it with 200,000 strings where half are to be excluded, then you'll see the affects of allocating and copying 300 bytes; then allocating and copying 600 bytes & freeing 600; then allocating and copying 900 bytes & freeing 900; ... 99,995 allocs/copies/frees omitted ...; then allocating copying 29,999,700 bytes & freeing 29.9997MB; then allocating and copying 30,000,000 bytes & freeing 30MB. Each time you do `$longest .= "\n" . $x;` perl has allocate a new chuck of memory big enough to accommodate `$longest + $x`; then copy those two into the newly allocated space, then free both the originals. And as each freed allocation is not big enough to accommodate the next iteration of append, each new allocation (once you get past trivial amounts) requires Perl to go to the OS for a new chunk of virtual memory. And that gets very costly. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^14: list of unique strings, also eliminating matching substrings by LanX (Saint) on Jun 03, 2011 at 15:25 UTC
Re^15: list of unique strings, also eliminating matching substrings by BrowserUk (Patriarch) on Jun 03, 2011 at 15:53 UTC
Some notes below your chosen depth have not been shown here