Find number of short words in long word

sedm1000 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Find number of short words in long word by ikegami (Patriarch) on Jul 14, 2009 at 20:19 UTC
'AGTGTAT' would generate >16000 columns for every combination), Much much less if you don't list all the possible combinations. To find the subsequences: `sub scan_seq { local our %sub_seqs; $_[0] =~ /(.{7})(?{ ++$sub_seqs{$1} })(?!)/; return \%sub_seqs; }` [download] All together: use strict; use warnings; sub scan_seq { my $len = shift; local our %sub_seqs; use re 'eval'; $_[0] =~ /(.{$len})(?{ ++$sub_seqs{$1} })(?!)/; return \%sub_seqs; } my @seqs = qw( ATGCTGTACTG ATGATGATCTG ATGCGTATGCA ATGCTAGACTG ); my $sub_seq_len = 7; my $max_len = length('Total'); my %cols; my %grid; for my $seq (@seqs) { $max_len = length($seq) if length($seq) > $max_len; my $row = scan_seq($sub_seq_len, $seq); $grid{$seq} = $row; $cols{$_} += $row->{$_} for keys %$row; } my @cols = sort keys %cols; my $col_sep = ' '; my $fmt_h = "%-${max_len}s" . ("$col_sep%-${sub_seq_len}s" x @cols) . +"\n"; my $fmt_b = "%-${max_len}s" . ("$col_sep%${sub_seq_len}d" x @cols) . +"\n"; printf($fmt_h, '', @cols); for my $seq (@seqs) { my $row = $grid{$seq}; printf($fmt_b, $seq, map $_\|\|0, @$row{@cols}); } printf($fmt_b, 'Total', map $_\|\|0, @cols{@cols}); [download] `ATGATCT ATGATGA ATGCGTA ATGCTAG ATGCTGT CGTATGC CTA +GACT CTGTACT GATGATC GCGTATG GCTAGAC GCTGTAC GTATGCA TAGACTG +TGATCTG TGATGAT TGCGTAT TGCTAGA TGCTGTA TGTACTG ATGCTGTACTG 0 0 0 0 1 0 + 0 1 0 0 0 1 0 0 + 0 0 0 0 1 1 ATGATGATCTG 1 1 0 0 0 0 + 0 0 1 0 0 0 0 0 + 1 1 0 0 0 0 ATGCGTATGCA 0 0 1 0 0 1 + 0 0 0 1 0 0 1 0 + 0 0 1 0 0 0 ATGCTAGACTG 0 0 0 1 0 0 + 1 0 0 0 1 0 0 1 + 0 0 0 1 0 0 Total 1 1 1 1 1 1 + 1 1 1 1 1 1 1 1 + 1 1 1 1 1 1` [download] If you really want to print all 16,384 columns, then replace `my @cols = sort keys %cols;` [download] with `my @cols = glob('{A,C,G,T}' x $sub_seq_len);` [download] It'll still use the minimum memory required. Update: Added complete code. Update: Made subsequence length variable instead of hard-coded. Update: Added change needed to display all columns.	[reply] [d/l] [select]
Re^2: Find number of short words in long word by repellent (Priest) on Jul 15, 2009 at 09:07 UTC
The regex version is nice. This is more explicit: `sub scan_seq { my ($len, $seq) = @_; my %sub_seqs; if ($len > 0) { ++$sub_seqs{ substr($seq, $_, $len) } for 0 .. length($seq) - $len; } return \%sub_seqs; }` [download] Update: Changed `local our` to `my`.	[reply] [d/l] [select]
Re: Find number of short words in long word by ikegami (Patriarch) on Jul 14, 2009 at 21:09 UTC
`qx/echo @{["{$a}" x $k]}/` [download] is better written as `glob("{$a}" x $k);` [download]	[reply] [d/l] [select]
Re^2: Find number of short words in long word by sedm1000 (Initiate) on Jul 14, 2009 at 21:44 UTC
Cool. Thanks, that does the trick. I'm always amazed by how quickly people here can come up with an answer I have spent days working on...	[reply]
Re: Find number of short words in long word by SuicideJunkie (Vicar) on Jul 14, 2009 at 20:20 UTC
The =()= 'operator' should do: `my $count =()= 'abracadabra' =~ /br/g; print "Found $count instances\n" # prints Found 2 instances` [download] Big Update: Based on all the comments below, this might be handy for automatically generating the lookaheads that allow overlapping results based on a search string: `sub countOverlappingMatches { my $stringRef = shift; #Likely to be huge, don't make a copy my $patternRef = shift; my $count =()= $$stringRef =~ /(?=$$patternRef)/g; return $count; } print countOverlappingMatches(\'abrabrabrabra', \'rabra'); # 3` [download] Or extract the one active line that is left in there. `for my $pattern (@listOfPatterns) { my $count =()= $string =~ /(?=$pattern)/g; # Play with $count }` [download]	[reply] [d/l] [select]
Re^2: Find number of short words in long word by ikegami (Patriarch) on Jul 14, 2009 at 20:42 UTC
`my $count =()= 'AAAAA' =~ /AAAA/g; print "Found $count instances\n"` [download] It prints 1, but I believe the OP wants 2. More specifically, from `ATGCTGTACTG`, I believe the OP wants `ACTG: 1 ATGC: 1 CTGT: 1 GCTG: 1 GTAC: 1 TACT: 1 TGCT: 1 TGTA: 1` [download]	[reply] [d/l] [select]
Re^2: Find number of short words in long word by psini (Deacon) on Jul 14, 2009 at 20:43 UTC
I don't know if it is important for the OP, but your code doesn't find overlapping instances: `my $count =()= 'abrabrabrabra' =~ /brabra/g; print "Found $count instances\n" # prints Found 2 instances (not 3)` [download] Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."	[reply] [d/l]
Re^3: Find number of short words in long word by moritz (Cardinal) on Jul 14, 2009 at 20:50 UTC
... and if it's something you need to fix, you can do so easily by packing everything but the first character into a look-ahead: `my $count =()= 'abrabrabrabra' =~ /b(?=rabra)/g;` [download] Of course in this case you know that the first possible overlap starts at the second be, so even `/bra(?=bra)/` works fine. (I haven't benchmarked it, but I suppose that non-look-around literals are a bit faster, due to optimizations regarding the match length).	[reply] [d/l] [select]
Re^4: Find number of short words in long word by AnomalousMonk (Archbishop) on Jul 14, 2009 at 23:37 UTC
Re^3: Find number of short words in long word by sedm1000 (Initiate) on Jul 14, 2009 at 20:54 UTC
Thanks greatly for the start on this... All possible combinations are important - certainly over 20K sequences they'll likely all appear at least a couple of times. Overlapping instances are important - so "AGCTGT" would need to be scored; `AGCT GCTG CTGT TGTA etc. AGCTGT 1 1 1 0` [download] and so on...	[reply] [d/l]
Re^4: Find number of short words in long word by SuicideJunkie (Vicar) on Jul 14, 2009 at 22:50 UTC