Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: list of unique strings, also eliminating matching substrings

by jaredor (Priest)
on May 21, 2011 at 07:58 UTC ( [id://906052]=note: print w/replies, xml ) Need Help??


in reply to list of unique strings, also eliminating matching substrings

Here's something that uses some of the best ideas of wind (proceed by decreasing length) and AnomalousMonk (try using index as a first pass at matching).

#!/usr/bin/env perl use strict; use warnings; my @strings = qw(AGCT AGGT GG AGCT); my %uniques; $uniques{shift @strings}++ while @strings; my @slots; while (my $i = each %uniques) { push @{$slots[length $i]}, $i; delete $uniques{i}; } my $master = join (':', @{pop @slots}); while (@slots) { my @nomatch = grep {index ($master, $_) < 0} @{pop @slots or []}; $master .= ':' . join (':', @nomatch) if @nomatch; } # answer $master =~ s/:/\n/g; print $master;

BrowserUK's questions are excellent. They affected my comment by reminding me to set expectations: My suggested solution will require a machine with enough memory to hold the entire data set. However, I've tried to keep the memory usage not too much more than that.

If I were more of a regexp whiz, I'd try to come up with some way of reducing matching up to a ':' boundary, but I'm not. Besides, I'm a recovering FORTRAN programmer, so the index command is the programming equivalent of comfort food for me.

I used a slightly more robust test set than what is listed here, but I emphasize "slightly". YMMV on real data with more corner cases....

Edit: "across" replaced by "up to" in the penultimate paragraph above. If you want to match a 399 character string within a 400 character string, you only need to check matches starting with the first two characters of the 400 character string, but the master string concatenation of all reference strings defeats any such understanding index may have. The hope is that one index on a long string is faster than N index calls on smaller strings (but I'm too lazy to check this today :-) It is very tempting to try to compile the master string into a savvy regexp (with the "o" flag) anew with each iteration of the last while loop and I'd be interested in seeing any such solution.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://906052]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2024-03-28 12:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found