nikkk has asked for the wisdom of the Perl Monks concerning the following question:

Hello everybody,

i'm a young biologist and new user of Perl, i updated my old ''homemade blat'' script, now it creates an hash table of oligos where the length of the keys is taken from the length of the query you want to search that you can type from keyboard. The search is made in a sequence that you open with @ARGV. The program then tells how many times and where a certain oligo is located.

use strict; my @bases; my @list; my @List; my %oligo = (); my %pos = (); my $base; my $list; my $oligo; my $f; my $nomefile; my $line; my $piece; my $query; my $len; my $sequenza; my $i; my $rev; my $revcomp; my $length; print("Insert query:\n"); $query = <STDIN>; chomp($query); $len = length($query); $query = uc $query; @bases = qw(A G C T); #bases @list = qw(A G C T); #seed for ($i = 0; $i < ($len - 1); $i++) { foreach $base (@bases) { foreach $list (@list) { $oligo{$base.$list} = 1; } @List = keys %oligo; } @list = keys %oligo; %oligo = (); #print "@list xxx \n"; } %oligo = (); @List = sort @List; $nomefile = $ARGV[0]; open $f, "< $nomefile" or die "cannot open $nomefile: $!"; $line = <$f>; chomp($line); if(substr($line,0,1) ne ">") { print STDERR ("Input file $nomefile not in FASTA format\n"); exit; } $sequenza = <$f>; while($line = <$f>) { chomp($line); if(substr($line, 0, 1) ne ">") { $sequenza = $sequenza.$line; } } $sequenza = uc $sequenza; for($i = 0; $i < length($sequenza) - ($len - 1); $i++) { $piece = substr($sequenza, $i, $len); if(exists $oligo{$piece}) { $oligo{$piece} = $oligo{$piece} + 1; push(@{$pos{$piece}} , $i ); } else { $oligo{$piece} = 1; push(@{$pos{$piece}} , $i ); } } foreach $piece (sort keys %oligo) { print("$piece\t$oligo{$piece}\t@{$pos{$piece}}\n\n"); } $rev = reverse $query; $rev =~ tr/ATCG/TAGC/; $revcomp = $rev; if(exists $oligo{$query} || exists $oligo{$revcomp}) { print("$query\t appear $oligo{$query} times\t in position @{$pos{$ +query}} on positive strand\n"); } else { print("$revcomp\t appear $oligo{$revcomp} times\t in position ($le +ngth - @{$pos{$query}}) on negative strand\n"); }

That's it. I'm searching for a method to calculate how many times an oligo appears on the negative strand and the positions where it appears. Anyone can tell me if it's correct? Any suggestions? Thank you very much!

Replies are listed 'Best First'.
Re: homemade blat: need help with reverse string.
by ww (Archbishop) on Dec 14, 2015 at 22:04 UTC

    perldoc -f index will help with the where. One standard technique for counting instances that you could apply would be to create a $var in last your else clause and increment it each time you find a sequence that satisfies your intent. Since you're working with hashes, other approaches may work better.

    Super Searching for nodes with include the words "count" and "hash" may lead you to numerous answers.

    And please, most of us are NOT bioinformatics scholars -- I have no more than a rough idea from context or what an "oligo" or a "negative stand" is-- and many of us have rather draconian limits on how much time we can spend helping... which means triage of nodes that are "TL,DR" (Too long, didn't read). Your 134 lines is a tad longer than the recommendation (in PerlMonks FAQ particularly what you'll find in Posting on PerlMonks ) that you boil your code down to a self-contained, concise (20-30 Lns or less) example that illustrates your issue.

Re: homemade blat: need help with reverse string.
by Athanasius (Cardinal) on Dec 15, 2015 at 03:07 UTC

    Hello nikkk,

    As ww says, those of us who are not biologists can only guess at what your script is doing. It would help us a lot if you included a (short!) example, showing input data (the contents of $query and $nomefile) together with the desired output. In the meantime, I’ll offer two pieces of advice on your code:

    First, always use warnings; As it happens, this won’t produce any warnings for your existing code. Nevertheless, it’s something you should get in the habit of using for every script.1 Why operate without a safety net when you don’t have to?

    Second, declare variables only when you need to, not all together in a block at the head of your script. This is better style, and it has the advantage that it makes it much easier to see the relation between a variable’s declaration and its use(s). In the present case, the variable $length is declared near the top of the script, and then not used again until the very end:

    my $length; ... if(exists $oligo{$query} || exists $oligo{$revcomp}) { ... } else { print("$revcomp\t appear $oligo{$revcomp} times\t in position ($le +ngth - @{$pos{$query}}) on negative strand\n"); # ^^^ +^^^^ }

    — at which point it is guaranteed to have a value of undef, which converts numerically to zero. I doubt this is what you intended. Declaring $length as late as possible (i.e., immediately before the point at which it is first used) would have alerted you to the fact that it’s not being assigned its correct value.

    Update: Also, the scalar variable $oligo is declared but never used. This fact is somewhat obscured by the various appearances of $oligo{...}, but they reference the hash variable %oligo.

    1Yes, even one-liners. perl -wE is just as easy to type as perl -E. :-)

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: homemade blat: need help with reverse string.
by Cristoforo (Curate) on Dec 15, 2015 at 17:48 UTC
    This seems to be similar to what you asked here Question: homemade blat: insert query from keyboard to find an oligo in an hashtable.

    If I understand your problem, this program would be a solution.

    #!/usr/bin/perl use strict; use warnings; print "Insert query:\n"; chomp(my $query = uc <STDIN>); print "\n"; my $nomefile = $ARGV[0]; open my $f, '<', $nomefile or die "Unable to open $nomefile. $!"; chomp(my $id_line = <$f>); die "Input file $nomefile not in FASTA format\n" if substr($id_line,0,1) ne ">"; my $sequenza; while (my $line = <$f>) { chomp $line; if (substr($line, 0, 1) eq '>') { report_matches($id_line, $sequenza, $query); $sequenza = ''; $id_line = $line; } else { $sequenza .= uc $line; } # necessary to report last record if eof report_matches($id_line, $sequenza, $query) if eof; } close $f or die "Unable to close $nomefile. $!"; sub report_matches { my ($id, $sequenza, $query) = @_; my %pos; my $revcomp = reverse $query =~ tr/ATCG/TAGC/r; my $len_query = length $query; for my $i (0 .. length($sequenza)-$len_query) { my $piece = substr $sequenza, $i, $len_query; if ($piece eq $query) { push @{ $pos{q} }, $i; } elsif ($piece eq $revcomp) { push @{ $pos{rc} }, $i; } } print "For id: $id\n"; if ($pos{q}) { printf "QUERY: $query\tappears %d times\t position @{ $pos{q} +}\n", scalar @{ $pos{q} }; } if ($pos{rc}) { printf "REVCOMP: $revcomp\tappears %d times\tposition @{ $pos{ +rc} }\n", scalar @{ $pos{rc} }; } print "\n"; }
    Some output from a fasta file.
    C:\Old_Data\perlp>perl test3.pl fasta_dat.txt Insert query: ttttt For id: >NR_037701 1 QUERY: TTTTT appears 6 times position 615 680 1114 1374 1428 2322 REVCOMP: AAAAA appears 47 times position 1640 2010 3304 3305 3 +306 3307 3308 3309 3 310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3 +324 3325 3326 3327 3 328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3 +342 3343 3344 3345 3 346 3347 3348 For id: >NM_198399 1 QUERY: TTTTT appears 19 times position 182 908 920 1014 116 +9 1191 1192 1313 131 4 1315 1316 1317 1472 1671 1705 1706 1956 2389 2390 REVCOMP: AAAAA appears 20 times position 591 953 1051 1052 127 +5 2149 2364 2373 237 4 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 For id: >NR_026816 1 QUERY: TTTTT appears 3 times position 371 372 384 REVCOMP: AAAAA appears 3 times position 593 594 595 For id: >NR_027917 1 QUERY: TTTTT appears 1 times position 339 For id: >NR_002777 3 QUERY: TTTTT appears 2 times position 921 922 REVCOMP: AAAAA appears 25 times position 286 577 975 976 977 9 +78 979 980 981 982 9 83 984 985 986 987 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 For id: >NR_033769 1 QUERY: TTTTT appears 7 times position 75 1176 1177 1229 1230 1321 +1322 REVCOMP: AAAAA appears 6 times position 629 1590 1591 1592 1598 1609 For id: >NM_016326 3 QUERY: TTTTT appears 1 times position 368 REVCOMP: AAAAA appears 5 times position 321 337 338 339 340 For id: >NM_181641 2 QUERY: TTTTT appears 4 times position 255 256 324 527 REVCOMP: AAAAA appears 5 times position 480 496 497 498 499 For id: >NM_001144931 1 For id: >NR_029429 1 For id: >NR_026551 1 QUERY: TTTTT appears 1 times position 415 For id: >NM_181640 2 QUERY: TTTTT appears 1 times position 464 REVCOMP: AAAAA appears 5 times position 417 433 434 435 436 For id: >NM_016951 3 QUERY: TTTTT appears 4 times position 255 256 324 623 REVCOMP: AAAAA appears 5 times position 576 592 593 594 595 For id: >NR_002773 1 QUERY: TTTTT appears 1 times position 1377 REVCOMP: AAAAA appears 2 times position 1929 1937 For id: >NR_037806 1 QUERY: TTTTT appears 6 times position 693 837 838 839 840 1114 REVCOMP: AAAAA appears 3 times position 1292 1293 1322
      Hello, Thank you for your kind response! I worked on it and this program does everything i need
      use strict; my @bases; my @list; my @List; my %oligo = (); my %pos = (); my $base; my $list; my $f; my $nomefile; my $line; my $piece; my $query; my $len; my $sequenza; my $i; my $j; my $rev; my $revcomp; my $oligo; my $length = length($sequenza); my $piececomp; print("Insert query:\n"); $query = <STDIN>; chomp($query); $len = length($query); $query = uc $query; @bases = qw(A G C T); #bases @list = qw(A G C T); #seed for ($i = 0; $i < ($len - 1); $i++) { foreach $base (@bases) { foreach $list (@list) { $oligo{$base.$list} = 1; } @List = keys %oligo; } @list = keys %oligo; %oligo = (); #print "@list xxx \n"; } %oligo = (); @List = sort @List; $nomefile = $ARGV[0]; open $f, "< $nomefile" or die "cannot open $nomefile: $!"; $line = <$f>; chomp($line); if(substr($line,0,1) ne ">") { print STDERR ("Input file $nomefile not in FASTA format\n"); exit; } $sequenza = <$f>; while($line = <$f>) { chomp($line); if(substr($line, 0, 1) ne ">") { $sequenza = $sequenza.$line; } } $sequenza = uc $sequenza; for($i = 0; $i < length($sequenza) - ($len - 1); $i++) # nell'ultimo f +rame di lettura non deve contare la len intera se no va 'fuori sequen +za' e non trova l'oligo corrispondente { $piece = substr($sequenza, $i, $len); if(exists $oligo{$piece}) { $oligo{$piece} = $oligo{$piece} + 1; push(@{$pos{$piece}} , $i ); } else { $oligo{$piece} = 1; push(@{$pos{$piece}} , $i ); } } foreach $piece (sort keys %oligo) { print("$piece\t$oligo{$piece}\t@{$pos{$piece}}\n\n"); } $rev= reverse $query; $rev=~ tr/ACTG/TGAC/; $revcomp = $rev; for($j = $length; $j < $length - $len; $j--) { $revcomp = substr($sequenza, $j, $len); if(exists $oligo{$revcomp}) { push(@{$pos{$revcomp}} , ($length - $j) - ($len - 1)); } } if(exists $oligo{$query} && !exists $oligo{$revcomp}) { print("$query\t compare $oligo{$query} volte\t in posizione @{$pos +{$query}} solo sul filamento positivo\n"); } else { if(exists $oligo{$query} && exists $oligo{$revcomp}) { print("$query\t compare $oligo{$query} volte\t in posizione @{ +$pos{$query}} sul filamento positivo\n\ne compare $oligo{$revcomp} vo +lte\t in posizione @{$pos{$revcomp}}\t sul filamento negativo\n"); } else { if(!exists $oligo{$query} && exists $oligo{$revcomp}) { print("$query compare $oligo{$revcomp} volte\t in posizion +e @{$pos{$revcomp}}\t sul filamento negativo\n"); } else { print("$query non compare nella sequenza\n"); } } }
      It tells me if an oligo appears only in positive strand, in positive and negative and only in negative strand by searching in an hash table i created using the length of the query written from keyboard. I'm searching an oligo on the negative strand by searching its reverse complementary on the positive strand. The sequence i'm working on begins with GATCAC and end with ACGATG. The program works fine with these sequence, but when i search their reverse complementary, the position on the negative strand are wrong (if i search the reverse complementary of the first oligo in the positive strand, it should be in the last position of the negative strand, instead the program tells me that it is in the first one). This is the problem! However you've done a great work, and it should help me with my problem!

        Looking at just $length and $len, the code shown has the following content:

        ... my $len; my $sequenza; ... my $j; ... my $length = length($sequenza); ... print "Insert query:\n"; $query = <STDIN>; chomp $query; $len = length($query); ... for($j = $length; $j < $length - $len; $j--) { ... } ...

        As you can see, when $length is initialized, $sequenza is undefined, so $length still has a numerical value of zero (as I pointed out above).

        However, even if $length were to be initialized correctly before the for loop, the loop condition still looks wrong to me. $j is initially assigned the value in $length, and is then decremented on each iteration. The stopping condition is $j < $length - $len, and neither $length nor $len is changed in the loop body. So if the condition is initially true, i.e., if $j is initially less than some fixed value, decrementing $j will never make the condition false (well, not until an integer underflow occurs, anyway), which means you effectively have an infinite loop.

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Indeed you're code was very very interersting, but i needed to create that hash table containing the oligos! However i finally came up on how to calculate the right positions of oligos on the negative strand thanks to your program (and looking up on how they were evalueted on the UCSC browser BLAT). Thank you for your kind help!
Re: homemade blat: need help with reverse string.
by GotToBTru (Prior) on Dec 15, 2015 at 13:50 UTC

    I think you are doing a bunch of extra work. You populate %oligo with every possible $query-sized substrand, and match by hash key. It would have to be easier to do basic string matching.

    Instead of building those hashes, simply compare $query and $revcomp to the substring within your for loop, report the position found at that time, and count up how many matches to report at the end. Better still, as suggested, use the index function and let perl do the work of matching the string.

    Dum Spiro Spero
Re: homemade blat: need help with reverse string.
by Anonymous Monk on Dec 15, 2015 at 13:55 UTC

    Be careful that @List is reset each time and held the value only to the last value with Thymine as first base eg. TA, TG ...