Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

finding sequence

by yueli711 (Sexton)
on Jun 22, 2018 at 18:05 UTC ( [id://1217240]=perlquestion: print w/replies, xml ) Need Help??

yueli711 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I want to find the sequences in tmp01 which are containing in tmp02, then print all of them out. Tab seperated tmp02 file. Thanks in advance!

tmp01 ATCCCACCGCTGCCACCA ACCCTGCTCGCTGCGCCA TCCCCGGCACCTCCACCA TCCCCGGCATCTCCACCA ATCCTGCCGACTACGCCA TCGATTCCCGGCCCATGCACCA TCGATTCCCGGCCAACGCACCA GTCCCACCAGAGTCGCCA ACCCCACTCCTGGTACCA GTCCCTTCGTGGTCGCCA tmp02 AACCCCATCCCACCGCTGCCACCA 1 AACCCCATCCTCGTCGCC 1 AACCCCATGAAATAAGAG 2 AACCCCATGATCAGGACAAG 1 AACCCCATTAAAAAATGG 1 AACTGGATTCTCTGAAATCCCACCGCTGCCACCA 1 AACTGGATTGTCTGTTTGT 1 AACTGGCAAGTTCAGGCATG 1 AACTGGCACACACAACC 1 AACTGGCACACACAACCT 1
open(IN1,"tmp01") || die "Cannot open this file"; @lines1 = <IN1>; open(IN2,"tmp02") || die "Cannot open this file"; @lines2 = <IN2>; open(OUT,">tmp03") || die "Cannot open this file"; for $item1(@lines1){ chomp $item1; #print OUT $item1,"\t"; @tmp1=split(/\t+/, $item1); for $item2(@lines2){ chomp $item2; @tmp2=split(/\t+/, $item2); if ($tmp1[0] =~m/ *$tmp2[0]*){ print OUT $item1,"\t",$item2; #last; } $i++ } print OUT "\n"; } close(IN1); close(IN2); close(OUT);

Replies are listed 'Best First'.
Re: finding sequence
by haukex (Archbishop) on Jun 22, 2018 at 19:40 UTC

    If I fix your regular expression (see perlretut), then note that $tmp1[0] =~ /\Q$tmp2[0]\E/ means "is the string $tmp2[0] contained in $tmp1[0]?". However, just as an example, $tmp1[0] may be "ATCCCACCGCTGCCACCA", while $tmp2[0] may be "AACCCCATCCCACCGCTGCCACCA" - so as you might be able to tell, you've got your condition reversed. Some better variable naming would really help here!

    You haven't shown your expected output, so I can't give any advice there other than to have a look at How do I post a question effectively?, SSCCE, and I know what I mean. Why don't you?

    A few general tips: Use strict and warnings (you've been told this twice before), and use the newer, three-argument form of open and lexical filehandles, as in open(my $infh1, '<', $filename) or die "Cannot open $filename: $!";

    Update: By the way, you could build a single regex out of all of the search strings - I wrote a tutorial on this at Building Regex Alternations Dynamically.

      It really solve my problem! Thanks!

Re: finding sequence
by hippo (Bishop) on Jun 22, 2018 at 18:58 UTC
            if ($tmp1[0] =~m/ *$tmp2[0]*){

    Your match pattern isn't terminated. Perl tells you this:

    $ perl 1217240.pl Search pattern not terminated at 1217240.pl line 29.

    Why the random indentation? It makes your code so hard to read. I'm also mildly intrigued as to why every other line is blank.

Re: finding sequence
by Laurent_R (Canon) on Jun 23, 2018 at 09:03 UTC
    Since you're looking for exact matches, I believe that using the index function might be more efficient than regex matching. In general, index is faster than regexes for exact matches, but, in that specific case, building a large pattern with all the search strings, as shown by tybalt89, might actually end up to be faster.

    You might also consider to store fh2 into a large string (joined by an appropriate separator), this is also likely to be faster than looping over the content of fh2.

Re: finding sequence
by tybalt89 (Monsignor) on Jun 22, 2018 at 23:09 UTC
    #!/usr/bin/perl # https://perlmonks.org/?node_id=1217240 use strict; use warnings; open my $patterns, '<', \<<END; ATCCCACCGCTGCCACCA ACCCTGCTCGCTGCGCCA TCCCCGGCACCTCCACCA TCCCCGGCATCTCCACCA ATCCTGCCGACTACGCCA TCGATTCCCGGCCCATGCACCA TCGATTCCCGGCCAACGCACCA GTCCCACCAGAGTCGCCA ACCCCACTCCTGGTACCA GTCCCTTCGTGGTCGCCA END open my $fh, '<', \<<END; AACCCCATCCCACCGCTGCCACCA 1 AACCCCATCCTCGTCGCC 1 AACCCCATGAAATAAGAG 2 AACCCCATGATCAGGACAAG 1 AACCCCATTAAAAAATGG 1 AACTGGATTCTCTGAAATCCCACCGCTGCCACCA 1 AACTGGATTGTCTGTTTGT 1 AACTGGCAAGTTCAGGCATG 1 AACTGGCACACACAACC 1 AACTGGCACACACAACCT 1 END my $seqs = join '|', map tr/ACGT//cdr, <$patterns>; my $match = qr/^\w*($seqs)/; /$match/ and print "$1\t$_" while <$fh>;

    Outputs:

    ATCCCACCGCTGCCACCA AACCCCATCCCACCGCTGCCACCA 1 ATCCCACCGCTGCCACCA AACTGGATTCTCTGAAATCCCACCGCTGCCACCA 1

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1217240]
Approved by taint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (7)
As of 2024-04-18 15:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found