comment on

Ok, now I understand. So first off, read about the index command. it will give you positions. So I _think_ you want something like this. Granted, it only takes the same sequence to find duplicated sequences after it. And as long as you do not tell us if gaps are allowed, if you really have 1's and 0's instead of GATC's. Python still seems the better way to go, just read this: http://codereview.stackexchange.com/questions/12522/simple-dna-sequence-finder-w-mismatch-tolerance
Meanwhile, this finds sequences with copies, without allowing gaps:

use strict; 
use warnings; 
use Term::ANSIColor;
use Data::Dumper;

my $X = "100100100010010110110101100100000"; # or use File::Slurp
my $s = "100"; # my pattern $s
my $L = length($s); # length of pattern
my @C; # store colors for later

my $counter = 0;
my $baseposition = 0;
my $newindex = 0;
my $subsequenceposition = 0;

while(($newindex=index($X,$s,$baseposition))>=$baseposition){
  # ok, found something, now checking subsequences
  print "From $baseposition, found '$s' at position $newindex\n";
  push(@C, {pos=>$newindex,length=>$L,color=>'black on_yellow'});
  $subsequenceposition = $newindex + $L;
  print "iterations will start from $subsequenceposition, seeking...\n
+";
  
  while(substr($X,$subsequenceposition,$L) eq $s){
    $counter++; 
    push(@C, {pos=>$subsequenceposition,length=>$L,color=>'black on_gr
+een'});
    print "Found reocurrance at $subsequenceposition ($counter reocurr
+ances found so far)\n";
    $subsequenceposition += $L;
  }
  
  print &colored("Found sequence at $newindex. With $counter reocurran
+ces", 'blue on_white'). "\n";
  
  # now after the last reocurrance, keep searching for our $s
  $baseposition = $subsequenceposition; 
  $counter = 0;
  print "Searching for more starting at $baseposition\n";
}

print "DONE\n";

# now print my sequence with colors
for my $p (sort {$b->{pos} <=> $a->{pos} } @C){
  substr($X, $p->{pos}+$p->{length}, 0) = color('reset');
  substr($X, $p->{pos}, 0) = color($p->{color});
  print $X . "\n";
}
[download]

In reply to Re^5: How to match more than 32766 times in regex? by FreeBeerReekingMonk
in thread How to match more than 32766 times in regex? by rsFalse

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.