Re: Exact string matching
by eyepopslikeamosquito (Archbishop) on Oct 16, 2011 at 10:52 UTC
|
I have wrote a script which will find me, how many times a given word has repeated
Please show us your code.
This is not a code writing service.
Please read How do I post a question effectively?
| [reply] |
|
|
Dear Monk, I'm so sorry for being very immature, this is my first time posting a question, so plz forgive my immaturity
open(HD,"file") or die ("Cant open");
$text=<HD>;
$text=~s/ //g;
chomp $text;
$pattern="word";
$offset = 0;
$pos=index $text,$pattern,$offset;
while ($pos != -1)
{
print "Found $pattern at $pos\n";
$offset = $pos + 1;
$pos = index($text, $pattern, $offset);
}
| [reply] [d/l] |
|
|
Looking at what you are trying to achieve, here is the code
use Data::Dumper;
open (HAN,'employee.pm');
my $cont = <HAN>; # assume $cont = 'package Employee df df';
my %hash = ();
while ( $cont =~ m/(\w+)/g )
{ $hash{$1}++;
}
print Dumper(\%hash); --------- output
$VAR1 = {
'Employee' => 1,
'df' => 2,
'package' => 1 };
it prints how many time each word occured .. | [reply] [d/l] |
|
|
|
|
(I was bit carried away...sry for my poor formating earlier)
Dear Ram, Thank you very much for your kind assistance but this works only if the words in the file is separated with a defined spacer such as a white space, what if the file contains only strings without any spacer (junk of characters or sequence of characters to be precise). That's where I am stuck. I need to find the number of occurrence of all possible substrings, that to in linear time (sry, that I was not clear).
example:
$text = 'howdoidoit'
and the answer should be like
For substring of length 3
how - 1
owd - 1
wdo - 1
doi - 2
oid - 1
ido - 1
oit - 1
| [reply] |
|
|
$text = 'howdoidoit';;
print for unpack '(a3X2)*', $text;;
how
owd
wdo
doi
oid
ido
doi
oit
it
it
print for unpack '(a4X3)*', $text;;
howd
owdo
wdoi
doid
oido
idoi
doit
oit
oit
oit
You have to discard the last n-1 results but that is very quick and simple to do.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
|
|
foreach ($cont =~ m/([a-z]{3})/g ){
$hash{$_}++;
}
what do you mean by liner time?
and lastly you need to modify the pattern depending on what you want, please work on it | [reply] [d/l] |
|
|
Re: Exact string matching
by Caio (Acolyte) on Oct 17, 2011 at 13:16 UTC
|
Well, since I'm trying to develop my perl skills to use in bioinformatics problems such as this, I thought I'd give it a shot:
use warnings;
use strict;
use Data::Dumper;
my %word_counts = ();
my $DNA = "CGTAGATCCAGTCGA"; # set for the test code, actual dna shoul
+d be parsed into a single line string with no whitespace
my $cur_len = 3; #set curent word length to minimum word length
my $max_len = (length $DNA) -1; #set maximum word length, set here to
+avoid recalculating $DNA length for every iteration
for (;$cur_len <= $max_len; $cur_len++){ #for each word length
my $last_pos = (length $DNA) -$cur_len; #again, set to avoid recalc
+ulating for every iteration
for (my $pos = 0; $pos <= $last_pos; $pos++){
$DNA =~ m/^.{$pos}(.{$cur_len})/;
$word_counts{$1}++;
}
}
print Dumper(\%word_counts);
exit;
The bottleneck here would be the ammount of word lengths you search. You could try tweaking that into fixed ranges for multiple program runs if you need to run it quickly. Or at least that's how I'd do it if it was me.
Hope it helps :)
PS: Would any of the fellow monks be kind to tell me if there's a way for the code tag not to break and wrap lines so shortly?
UPDATE: Just realized that code would probably consider AtC and ATC different words, so when you get your DNA sequence into the variable you should also make sure it's all upper or lower cased. like:
$DNA = "\U$DNA"; | [reply] [d/l] [select] |
Re: Exact string matching
by baxy77bax (Deacon) on Oct 17, 2011 at 15:04 UTC
|
Hi,
well though the fellow monks have posted some interesting solutions, and that is why this forum is much better than any other, i think that the real answer, the one that meets your requirements, is : "You cannot!" What do i mean by this, well i have been into string bizz for a while now, and dealing with the exact problem that you have stated and those bit more complicated. First, to meet the requirements in terms of ram you will not achieve this in perl. try c, and even then you will have some high constants associated with your data structure. Second, the best linear time, proportional to the size of the pattern, searches, can be achieved by using suffix trees (Ukkonnen). There is a perl module (module) that is extremely good for this and i would suggest buying some extra ram and using it instead writing the tree by yourself in c. It has an O(P) search time capabilities and as far as the "additional constants" associated with perl goes,well you can in fact neglect those since the tree will use much more space(RAM).
This was a comment to your statement:"...this is not efficient as I have to deal with file sized over GB's..."
Now as GrandFather always keeps reminding me:
"Note that "efficient" probably isn't the same as "fast"..."
and almost always is right, maybe this isn't what you need. if i were you i would look into bit-vectors -> through them you can compress your DNA by 75%, meaning 75% less memory required and 75% shorter the search space, that on a 2GB data will outperform(in terms of effective efficiency) the suffix tree and O(P) time complexity, though it doesn't run in linear time. However if you are thinking of using it on larger data sets, well welcome to the club and good luck !
Cheers
Baxy | [reply] |
|
|
Dear Monk,
Thank you very much for your wonderful and meaningful post. ya i bet, this is a wonderful platform where you get such caring guidance, there's no other place for perl other than this. I have already tried the recommended module that you have mentioned, but the problem with that is, it makes use of an external library written in C and more over that library is near an abandonware (stated by the author, i.e. he will accept the patches but has no idea to do by himself) and I cant modify it as to my needs because i am not good at C (my background is life science). I even tried to implement Suffix array and to my surprise there are no known perl module(if i am not mistaken) but havn't gave up i am trying it still. and ya what was that "bit vectors" all about, it seems interesting (Sry I havn't heard of it) could you shed some light over it or direct me to some digestible article so that i could grasp it (as i am not from computer science background)
| [reply] |
Re: Exact string matching
by saranrsm (Acolyte) on Oct 17, 2011 at 06:16 UTC
|
Dear Monks,
Thank you very much every one. I didn't expect this many replies.
=>Ram:By Linear time I means O(n), which deals with time complexity.
=>BrowserUK & AnomalousMonk:Thank you very much guys that was a very gud kick start.
So as BrowseUK and AnomalousMonk suggested I have attached the script, which I have modified as per the requirements but the problem with this is it took 3mins to run on a file sized only 7 MB. this is not efficient as I have to deal with file sized over GB's. So now how do I optimize it for file sized over GB's or is there any other efficient way to accomplished this.
Note: I have noticed that using an array slows down the run time.Isn't much a big discovery, but can we use any alternate.
use List::MoreUtils qw(uniq);
open(HD,"File") or die ("Cant open");
$text=<HD>;
chomp $text;
my $n = 6;
my $back = $n - 1;
my @array = unpack qq{(a$n X$back)*}, $text;
@unpacked=uniq @array;
for($i=0;$i<=$#unpacked-$back;$i++)
{
$pattern= $unpacked[$i];
$offset = 0;
print "\n$pattern ($n)\n";
print '~' x $n,"\n";
$pos=index $text,$pattern,$offset;
while ($pos != -1)
{
print $pos+1," to ",$pos+$n,"\n";
$offset = $pos + 1;
$pos = index($text, $pattern, $offset);
}
}
| [reply] [d/l] |
|
|
Why do you want to do this thing? Very likely we can make better suggestions if you give us the bigger picture of what you are trying to achieve.
Note that "efficient" probably isn't the same as "fast". Note too that nested loops most often imply O(n2). Using a hash can turn that back into O(n).
True laziness is hard work
| [reply] |
|
|
Dear Monk,
Here's my exact problem.
I want to find out the positions and the frequency of all possible repeated substrings of fixed length from a given file which looks like this (http://www.ncbi.nlm.nih.gov/nuccore/AC245882.2?report=fasta&log$=seqview&format=text).
eg)$text='ATCGCTATCGC'
For substring of length 3
ATC => Occurred 2 times, at 0 and 6
CGC => Occurred 2 times, at 2 and 8
For substring of length 4
ATCG => Occurred 2 times, at 0 and 6
TCGC => Occurred 2 times, at 1 and 7
and so on..
If I used my script (which I have posted), it would take ages to complete for the above linked file.
if I am not clear yet, please let me know.
| [reply] |
|
|
|
|
|