twaddlac has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am having a little trouble getting the match regex to work for whatever reason. I have no idea how this isn't working and my head is spinning about it. If you can, could you look at my code and see if you find anything strange about it? Thank you very much!


#! usr/bin/perl -w use warnings; use strict; use Bio::Seq; use Bio::SeqIO; #print "What is the filepath of the input file?\n"; my $file = $ARGV[0]; #print "What is the name of the output file?\n"; my $output = $ARGV[1]; my $trash = $ARGV[2]; my $seq_in = Bio::SeqIO->new(-file => $file, -format => "fasta"); open OUTPUT, ">$output"; open TRASH, ">$trash"; my $temp_seq_counter = 0; my $match_counter = 0; my $seq; my $tag_id; my $position; my $tag_name; while(my $seq_obj = $seq_in->next_seq){ $temp_seq_counter += 1; my $temp_seq = $seq_obj->seq; my $temp_seq_name = $seq_obj->id; my $tag_file = Bio::SeqIO->new(-file => "</home/Alan/Desktop/seque +nce_data/INFLUENZA_01_07_2010/MIDS.fasta", -format => "fasta"); while(my $tag_obj = $tag_file->next_seq){ my $tag = $tag_obj->seq; print "tag = ",$tag,"\n"; my $RC_tag = reverse $tag =~ tr/ACTGactg/TGACtgac/; if($temp_seq =~ m/$tag/g || $temp_seq =~ m/$RC_tag/g){ $position = pos($temp_seq); my $length = length($temp_seq); $seq = $seq_obj->subseq($position,$length); $tag_name = $tag_obj->id; $match_counter = 1; } } if($match_counter == 1){ print OUTPUT ">",$temp_seq_name," Tag: ",$tag_name," ending at + ",$position,"\n",$seq,"\n"; } else{ print TRASH ">",$temp_seq_name," ",$seq_obj->desc,"\n",$temp_s +eq,"\n"; } $match_counter = 0; } print $temp_seq_counter," seqeuences were tested and ",$match_counter, +" seqeuences have a tag.\n"; close OUTPUT; close TRASH;
Sample input:
>GJVIMO101AUT0H length=45 xy=0234_0223 region=1 run=R_2010_07_01_11_09 +_50_ ACGACACGTATACGTGCGTGTCGCGTCTCTCAGCACACAGAGTAG >GJVIMO101ANKZK length=45 xy=0151_1902 region=1 run=R_2010_07_01_11_09 +_50_ ACGACACGTATCGCGCGCGNGCGCGCGCGCGCGCGCGCGCGCGCG >GJVIMO101AOIE9 length=41 xy=0162_0179 region=1 run=R_2010_07_01_11_09 +_50_ ACGACACGTATCTCATTGTGCTCAAGGCCTGAGCACAATGA >GJVIMO101ALCLG length=100 xy=0126_0114 region=1 run=R_2010_07_01_11_0 +9_50_ ACGACACGTATGCTGCTGGTGCTGCTGTAACAGTTCCTGCTGATGCTGCAAGTGCTGCTG CTGTAACTGTTGCTGCTGTAATCTCTGCTGCTGCTGCTGT
Pattern to match:
ACGACACGTAT

Replies are listed 'Best First'.
Re: Regex Match Problem
by umasuresh (Hermit) on Aug 24, 2010 at 17:55 UTC
    I use the following snippet from http://www.perlmonks.org/?abspart=1;displaytype=displaycode;node_id=308179;part=1
    for parsing fasta format files: Unfortunately, I can't find this script's author. I just modified it slightly to try your problem.
    use strict; use Data::Dumper; my ($counter, $line_count, @rec); { local $/ = '>'; while (<DATA>) { s/^>//g; # strip out '>' from beginning s/>$//g; # and end of line next if !length($_); # ignore empty lines my ($header_info) = /^(.*)\n/; # capture the header s/^(.*)\n//; # and strip it out push @rec, $header_info; s/\n//mg; # join the sequence strings $counter++ if $_=~/ACGACACGTAT/; } } $line_count = scalar @rec; print "$line_count seqeuences were tested and $counter seqeuences hav +e a tag\n"; __DATA__ >GJVIMO101AUT0H length=45 xy=0234_0223 region=1 run=R_2010_07_01_11_09 +_50_ ACGACACGTATACGTGCGTGTCGCGTCTCTCAGCACACAGAGTAG >GJVIMO101ANKZK length=45 xy=0151_1902 region=1 run=R_2010_07_01_11_09 +_50_ ACGACACGTATCGCGCGCGNGCGCGCGCGCGCGCGCGCGCGCGCG >GJVIMO101AOIE9 length=41 xy=0162_0179 region=1 run=R_2010_07_01_11_09 +_50_ ACGACACGTATCTCATTGTGCTCAAGGCCTGAGCACAATGA >GJVIMO101ALCLG length=100 xy=0126_0114 region=1 run=R_2010_07_01_11_0 +9_50_ ACGACACGTATGCTGCTGGTGCTGCTGTAACAGTTCCTGCTGATGCTGCAAGTGCTGCTG CTGTAACTGTTGCTGCTGTAATCTCTGCTGCTGCTGCTGT >GJVIMO111ALCLG length=100 xy=0126_0114 region=1 run=R_2010_07_01_11_0 +9_50_ GCTGCTGGTGCTGCTGTAACAGTTCCTGCTGATGCTGCAAGTGCTGCTG CTGTAACTGTTGCTGCTGTAATCTCTGCTGCTGCTGCTGT
Re: Regex Match Problem
by mwah (Hermit) on Aug 24, 2010 at 19:07 UTC

    I'd think you didn't get your reversal of the tag-sequence right

    I modified this to:

    print "tag \t= $tag\n"; my $RC_tag = reverse($tag); print "RC_tag \t= $RC_tag\n"; $RC_tag =~ tr/ACTGactg/TGACtgac/; print "RC_tag \t= $RC_tag\n";

    This would print:

    tag = ACGACACGTAT RC_tag = TATGCACAGCA RC_tag = ATACGTGTCGT

    Which one would be correct?

    In your original code,

    my $tag = $tag_obj->seq; print "tag = ",$tag,"\n"; my $RC_tag = reverse $tag =~ tr/ACTGactg/TGACtgac/;

    $RC_tag would be the number '11', which probably isn't what you want.

    Regards

    mwa