in reply to Exact string matching

Dear Monks,

Thank you very much every one. I didn't expect this many replies.

=>Ram:By Linear time I means O(n), which deals with time complexity.

=>BrowserUK & AnomalousMonk:Thank you very much guys that was a very gud kick start.

So as BrowseUK and AnomalousMonk suggested I have attached the script, which I have modified as per the requirements but the problem with this is it took 3mins to run on a file sized only 7 MB. this is not efficient as I have to deal with file sized over GB's. So now how do I optimize it for file sized over GB's or is there any other efficient way to accomplished this.

Note: I have noticed that using an array slows down the run time.Isn't much a big discovery, but can we use any alternate.
use List::MoreUtils qw(uniq); open(HD,"File") or die ("Cant open"); $text=<HD>; chomp $text; my $n = 6; my $back = $n - 1; my @array = unpack qq{(a$n X$back)*}, $text; @unpacked=uniq @array; for($i=0;$i<=$#unpacked-$back;$i++) { $pattern= $unpacked[$i]; $offset = 0; print "\n$pattern ($n)\n"; print '~' x $n,"\n"; $pos=index $text,$pattern,$offset; while ($pos != -1) { print $pos+1," to ",$pos+$n,"\n"; $offset = $pos + 1; $pos = index($text, $pattern, $offset); } }

Replies are listed 'Best First'.
Re^2: Exact string matching
by GrandFather (Saint) on Oct 17, 2011 at 06:39 UTC

    Why do you want to do this thing? Very likely we can make better suggestions if you give us the bigger picture of what you are trying to achieve.

    Note that "efficient" probably isn't the same as "fast". Note too that nested loops most often imply O(n2). Using a hash can turn that back into O(n).

    True laziness is hard work
      Dear Monk,
      
      Here's my exact problem.
      I want to find out the positions and the frequency of all possible repeated substrings of fixed length from a given file which looks like this (http://www.ncbi.nlm.nih.gov/nuccore/AC245882.2?report=fasta&log$=seqview&format=text).
      
      eg)$text='ATCGCTATCGC'
      
      For substring of length 3
      
      ATC => Occurred 2 times, at 0 and 6
      CGC => Occurred 2 times, at 2 and 8
      
      For substring of length 4
      
      ATCG => Occurred 2 times, at 0 and 6
      TCGC => Occurred 2 times, at 1 and 7
      
      and so on..
      
      If I used my script (which I have posted), it would take ages to complete for the above linked file.
      
      if I am not clear yet, please let me know.
      

        What are the minimum and maximum substring lengths you need to work with for your real data? With what you imply here this is never going to be a fast task! Oh, and you still haven't told us what you are trying to do! The more detail you give us about what you are trying to achieve the better we can help. So far you really haven't given us much.

        You may be interested in Fast common substring matching which attempts to solve a problem that on the face of it may be similar to the problem you are working on.

        True laziness is hard work