Exact string matching

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Exact string matching by eyepopslikeamosquito (Archbishop) on Oct 16, 2011 at 10:52 UTC
I have wrote a script which will find me, how many times a given word has repeated Please show us your code. This is not a code writing service. Please read How do I post a question effectively?	[reply]
Re^2: Exact string matching by Anonymous Monk on Oct 16, 2011 at 11:08 UTC
Dear Monk, I'm so sorry for being very immature, this is my first time posting a question, so plz forgive my immaturity `open(HD,"file") or die ("Cant open"); $text=<HD>; $text=~s/ //g; chomp $text; $pattern="word"; $offset = 0; $pos=index $text,$pattern,$offset; while ($pos != -1) { print "Found $pattern at $pos\n"; $offset = $pos + 1; $pos = index($text, $pattern, $offset); }` [download]	[reply] [d/l]
Re^3: Exact string matching by ramprasad27 (Sexton) on Oct 16, 2011 at 11:41 UTC
Looking at what you are trying to achieve, here is the code `use Data::Dumper; open (HAN,'employee.pm'); my $cont = <HAN>; # assume $cont = 'package Employee df df'; my %hash = (); while ( $cont =~ m/(\w+)/g ) { $hash{$1}++; } print Dumper(\%hash); --------- output $VAR1 = { 'Employee' => 1, 'df' => 2, 'package' => 1 };` [download] it prints how many time each word occured ..	[reply] [d/l]
Re^4: Exact string matching by Anonymous Monk on Oct 16, 2011 at 12:40 UTC
Re^3: Exact string matching by Anonymous Monk on Oct 16, 2011 at 12:52 UTC
(I was bit carried away...sry for my poor formating earlier) Dear Ram, Thank you very much for your kind assistance but this works only if the words in the file is separated with a defined spacer such as a white space, what if the file contains only strings without any spacer (junk of characters or sequence of characters to be precise). That's where I am stuck. I need to find the number of occurrence of all possible substrings, that to in linear time (sry, that I was not clear). example: $text = 'howdoidoit' and the answer should be like For substring of length 3 how - 1 owd - 1 wdo - 1 doi - 2 oid - 1 ido - 1 oit - 1	[reply]
Re^4: Exact string matching by BrowserUk (Patriarch) on Oct 16, 2011 at 20:05 UTC
The fastest way to n-tuple long strings is using unpack: `$text = 'howdoidoit';; print for unpack '(a3X2)', $text;; how owd wdo doi oid ido doi oit it it print for unpack '(a4X3)', $text;; howd owdo wdoi doid oido idoi doit oit oit oit` [download] You have to discard the last n-1 results but that is very quick and simple to do. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^5: Exact string matching by AnomalousMonk (Archbishop) on Oct 16, 2011 at 23:29 UTC
Re^4: Exact string matching by ramprasad27 (Sexton) on Oct 16, 2011 at 16:27 UTC
try this `foreach ($cont =~ m/([a-z]{3})/g ){ $hash{$_}++; }` [download] what do you mean by liner time? and lastly you need to modify the pattern depending on what you want, please work on it	[reply] [d/l]
Re^5: Exact string matching by Corion (Patriarch) on Oct 16, 2011 at 16:30 UTC
Re: Exact string matching by Caio (Acolyte) on Oct 17, 2011 at 13:16 UTC
Well, since I'm trying to develop my perl skills to use in bioinformatics problems such as this, I thought I'd give it a shot: use warnings; use strict; use Data::Dumper; my %word_counts = (); my $DNA = "CGTAGATCCAGTCGA"; # set for the test code, actual dna shoul +d be parsed into a single line string with no whitespace my $cur_len = 3; #set curent word length to minimum word length my $max_len = (length $DNA) -1; #set maximum word length, set here to +avoid recalculating $DNA length for every iteration for (;$cur_len <= $max_len; $cur_len++){ #for each word length my $last_pos = (length $DNA) -$cur_len; #again, set to avoid recalc +ulating for every iteration for (my $pos = 0; $pos <= $last_pos; $pos++){ $DNA =~ m/^.{$pos}(.{$cur_len})/; $word_counts{$1}++; } } print Dumper(\%word_counts); exit; [download] The bottleneck here would be the ammount of word lengths you search. You could try tweaking that into fixed ranges for multiple program runs if you need to run it quickly. Or at least that's how I'd do it if it was me. Hope it helps :) PS: Would any of the fellow monks be kind to tell me if there's a way for the code tag not to break and wrap lines so shortly? UPDATE: Just realized that code would probably consider AtC and ATC different words, so when you get your DNA sequence into the variable you should also make sure it's all upper or lower cased. like: `$DNA = "\U$DNA";`	[reply] [d/l] [select]
Re: Exact string matching by baxy77bax (Deacon) on Oct 17, 2011 at 15:04 UTC
Hi, well though the fellow monks have posted some interesting solutions, and that is why this forum is much better than any other, i think that the real answer, the one that meets your requirements, is : "You cannot!" What do i mean by this, well i have been into string bizz for a while now, and dealing with the exact problem that you have stated and those bit more complicated. First, to meet the requirements in terms of ram you will not achieve this in perl. try c, and even then you will have some high constants associated with your data structure. Second, the best linear time, proportional to the size of the pattern, searches, can be achieved by using suffix trees (Ukkonnen). There is a perl module (module) that is extremely good for this and i would suggest buying some extra ram and using it instead writing the tree by yourself in c. It has an O(P) search time capabilities and as far as the "additional constants" associated with perl goes,well you can in fact neglect those since the tree will use much more space(RAM). This was a comment to your statement:"...this is not efficient as I have to deal with file sized over GB's..." Now as GrandFather always keeps reminding me: "Note that "efficient" probably isn't the same as "fast"..." and almost always is right, maybe this isn't what you need. if i were you i would look into bit-vectors -> through them you can compress your DNA by 75%, meaning 75% less memory required and 75% shorter the search space, that on a 2GB data will outperform(in terms of effective efficiency) the suffix tree and O(P) time complexity, though it doesn't run in linear time. However if you are thinking of using it on larger data sets, well welcome to the club and good luck ! Cheers Baxy	[reply]
Re^2: Exact string matching by saranrsm (Acolyte) on Oct 17, 2011 at 18:23 UTC
Dear Monk, Thank you very much for your wonderful and meaningful post. ya i bet, this is a wonderful platform where you get such caring guidance, there's no other place for perl other than this. I have already tried the recommended module that you have mentioned, but the problem with that is, it makes use of an external library written in C and more over that library is near an abandonware (stated by the author, i.e. he will accept the patches but has no idea to do by himself) and I cant modify it as to my needs because i am not good at C (my background is life science). I even tried to implement Suffix array and to my surprise there are no known perl module(if i am not mistaken) but havn't gave up i am trying it still. and ya what was that "bit vectors" all about, it seems interesting (Sry I havn't heard of it) could you shed some light over it or direct me to some digestible article so that i could grasp it (as i am not from computer science background)	[reply]
Re: Exact string matching by saranrsm (Acolyte) on Oct 17, 2011 at 06:16 UTC
Dear Monks, Thank you very much every one. I didn't expect this many replies. =>Ram:By Linear time I means O(n), which deals with time complexity. =>BrowserUK & AnomalousMonk:Thank you very much guys that was a very gud kick start. So as BrowseUK and AnomalousMonk suggested I have attached the script, which I have modified as per the requirements but the problem with this is it took 3mins to run on a file sized only 7 MB. this is not efficient as I have to deal with file sized over GB's. So now how do I optimize it for file sized over GB's or is there any other efficient way to accomplished this. Note: I have noticed that using an array slows down the run time.Isn't much a big discovery, but can we use any alternate. `use List::MoreUtils qw(uniq); open(HD,"File") or die ("Cant open"); $text=<HD>; chomp $text; my $n = 6; my $back = $n - 1; my @array = unpack qq{(a$n X$back)*}, $text; @unpacked=uniq @array; for($i=0;$i<=$#unpacked-$back;$i++) { $pattern= $unpacked[$i]; $offset = 0; print "\n$pattern ($n)\n"; print '~' x $n,"\n"; $pos=index $text,$pattern,$offset; while ($pos != -1) { print $pos+1," to ",$pos+$n,"\n"; $offset = $pos + 1; $pos = index($text, $pattern, $offset); } }` [download]	[reply] [d/l]
Re^2: Exact string matching by GrandFather (Saint) on Oct 17, 2011 at 06:39 UTC
Why do you want to do this thing? Very likely we can make better suggestions if you give us the bigger picture of what you are trying to achieve. Note that "efficient" probably isn't the same as "fast". Note too that nested loops most often imply O(n²). Using a hash can turn that back into O(n). True laziness is hard work	[reply]
Re^3: Exact string matching by saranrsm (Acolyte) on Oct 17, 2011 at 07:10 UTC
Dear Monk, Here's my exact problem. I want to find out the positions and the frequency of all possible repeated substrings of fixed length from a given file which looks like this (http://www.ncbi.nlm.nih.gov/nuccore/AC245882.2?report=fasta&log$=seqview&format=text). eg)$text='ATCGCTATCGC' For substring of length 3 ATC => Occurred 2 times, at 0 and 6 CGC => Occurred 2 times, at 2 and 8 For substring of length 4 ATCG => Occurred 2 times, at 0 and 6 TCGC => Occurred 2 times, at 1 and 7 and so on.. If I used my script (which I have posted), it would take ages to complete for the above linked file. if I am not clear yet, please let me know.	[reply]
Re^4: Exact string matching by GrandFather (Saint) on Oct 17, 2011 at 07:41 UTC
Re^5: Exact string matching by saranrsm (Acolyte) on Oct 17, 2011 at 10:01 UTC
Some notes below your chosen depth have not been shown here