lomSpace has asked for the wisdom of the Perl Monks concerning the following question:

Hello!
I am trying to process records in a file based on a regex.
I have set my record separator to '>'. The new file will have
the same number of records as the input data. The only difference
will be the substitution and tranliteration of the records due to
identification by regex.
#!/usr/bin/perl use strict; use warnings; use Data::Dumper; =cut Open fasta with multiple fasta sequences. Select those sequences based on an identifier and then reverse complement them. =cut # Set record separator $/ = '>'; #open(my $in, "C:\Documents and Settings\mydir\Desktop\rev_comp\13414_ +fasta"); open(my $out, ">C:/Documents and Settings/mydir/Desktop/rev_comp/13414 +_fasta_rev_comp"); while(<DATA>){ if(my $line =~m/LacZ|SD/){ next $line; my $revcom = reverse $line; # Next substitute all bases by their complements, # A->T, T->A, G->C, C->G $revcom =~ s/A/T/g; $revcom =~ s/T/A/g; $revcom =~ s/G/C/g; $revcom =~ s/C/G/g; # Make a new copy of the DNA $revcom = reverse $line; # The Perl translate/transliterate command: $revcom =~ tr/ACGTacgt/TGCAtgca/; #print Dumper($/, $line); #} } } #close $in; close $out; __DATA__ >AM_13414L3_LacZ.SEE.rc_G01_2009-05-01.ab1 1368 0 1368 ABI TTTTTCCCCCAACAAAGGGGAGGGTGGGCGGCTAGTCTGTTCAGCTGTGT CACACCGGGATTCTCCCAATCTCTCCTCTGCAGGACCACTGGATCATTTA AATCGGTACCCATCTTCTTAGTGGGCAGACCCAGCTGGCCTTCAGACTGC TTGCTGTTCCTGGCCCGGTCTTGCTATTTATACATGTAAGAGGATCAGGA AGTCCCTGGGGTACAGCTCATAATGCCCTCCTTTGACTACATAACACCCA ACATGCTAGTTCTAAGAGAGGGAACAGTGTGCAGTGGGAAGTGGAGGGCA AAGGTGACTTGGGGCTTTCCAAAGTTCAAATTGATTCAGAGAGAGTAAAT ATTTCCAGAAGGATTTCTCCTTTTATAAAATTCATTCACTCCTTTAGCTC TGACCACAGGGTGGGAGTGAGGGATCCTTCTAGACCCCTGATGAGAGGTT AGCTTGGAGGACGCTGGCTTATGCTCATTGACAGCTGACCGACAGATATA GATTATAAAAGTAAACTTATATGTCTTGCCAGAGATATATAAAATTGTTG TCAACTCCTTCTTTAATTATTTTTCTTTAATTTTTAAAGATTTATTTTAT ATCCATGTTTTGCCTGCATGTGTGTATGTCTACCACATACATGCAGTGCT GTGCAGGTCAGAAGAGGGTGTTAAATTCCCTGGTACTAGAGTTACAGATG GTTGTGAGCCATCATGTGGATGCTGAGAACTGAAGCCAGCAAGTGTCCTT AACCGCTGAGCCAACTCTCCAGCCCCTTTAGATATTTTTAATATACTTTA ACATCAGAGGAAAAAAAAATCTTTAGAACGTCTGTCAGAAGAAACATCTA AGGCTGGTTGGGGTGGTGTTCACCACTTGGTGTCAGCACTTGGGAGCCAG AGGCAGGTGTGTGTGTGTGTTTGAGGCCAGTCTGGTCTACACACTCAGTT ATCCAATCTCCGTGAGTTTGTGAATGTTTGCTGTTCATTTGGGGTTTTAG TCTGATGTGGTCAAATAGAATAGGAAGAGAGGGCTAAAGACCCACCTTAC TGGTTTAAAGCACTTGTTGCTTTTTTAAAAAACCAAGTTTAATTCTTTCG GAGTTTCATTAGCCCTTTTTCTATTAGGGAGGGACCCCTTTTTTCTTGAT TTATAAAGGACCCCTTTGCTTGGCAATTCTGTTTTTGGGCTGGAGGGTCC AGGTTTTCCAAACTTTGGGAAATGCCTTTCCACCCTTTCTGTTCCCCTGA TGGACAATTTCCTGCCCCATGAATTTAATGGGTTTCTCTTTTATGGCTTT TTAAACATTTTTTTTTTGTTTTTTAAAAACTTTTTTCCTTTTAAACTTTT TATTTTATAATTTGAAAA >AM_13414L3_SD_F01_2009-05-01.ab1 1397 0 1397 ABI AATTTAAAGCATACTGTAAATACTACTAACTAAAGGGCAAAATAGGGCAT CAGTTTTCTTTGGAATTGGAATTATAGATAGTTTGAGCTGCCATCTAAGT GGGAATTGAACCCAGGTCCTCTGGAAGAGCAGCAGGTGCTCTTAACCACC AAGCCATCTCTCCAGACCTTGCCCATTTATCTCAATCAAATATTATGTGT AGTCATTGAGGTCAGCTTCAGACCTTCCAGGCATCTGAGTTTTCAGATGA CTGGGGTTGGCACAGACAAGTTTCCCCTCTGTGACAAAGCCAGATATGCC ACTTTAAAGTGGAACAGAAAAAAAAATGTTTATATACCTATAAAAATAAA CACTTAGAGCCACTTAGGTGGTCACTGGGGAAGACCAAAGAAAGTAGCTG GCAGTTCACACCCTTCTCTGCTAGCATAACTTCGTATAGCATACATTATA CGAAGTTATCTAGGGGCTGCAGGTCGAGGTCTGATGGAATTAGAACTTGG CAAAACAATACTGAGAATGAAGTGTATGTGGAACAGAGGCTGCTGATCTC GTTCTTCAGGCTATGAAACTGACACATTTGGAAACCACAGTACTTAGAAC CACAAAGTGGGAATCAAGAGAAAAACAATGATCCCACGAGAGATCTATAG ATCTATAGATCATGAGTGGGAGGAATGAGCTGGCCCTTAATTTGGTTTTG CTTGTTTAAATTATGATATCCAACTATGAAACATTATCATAAAGCAATAG TAAAGAGCCTTCAGTAAAGAGCAGGCATTTATCTAATCCCACCCCACCCC CACCCCCGTAGCTCCAATCCTTCCATTCAAAATGTAGGTACTCTGTTCTC ACCCTTCTTAACAAAGTATGACAGGAAAAACTTCCATTTTAGTGGACATC TTTATTGTTTAATAGATCATCAATTTCTGCAGACTTACAGCGGATCCCCT CAGAAGAACTCGTCAAAGAAGCGATAGAAGGCGATGCGCTGCGAATCGGG AGCGGCGATACCCGTAAGCACGAGGAAACGGTCAGCCCATTCGCCGCCAA GCTCTTCAGCAATATCACGGGTAGCCAACGCTATGTTCTGATAGCGGTCC CCCACACCCAGCCGGCCACAGTCGATGAATCCAGAAAAACGGGCCTTTTT CACCCTGAATATCGGCAAGCAGGCATTCGCCTGGGGTAACGACGAGTTCC TTCGCCGTCGGGCATGCCCGCCCTTGAGCCCGGGCGAACAGTTTCGGCTG GCCCCGAGCCCCCTGATGCTTCTTTCTTCCAAATTCATCCTGGTTCAAAC AGAACCCGGCTTTCCCATCCCCAATAACCTGGCCTTCCTTTCGGATGCGG AATGTTTTTCCCTTTGGGGGGGTCAAAAAGGGGGCACGGGGAGCCCN >AM_13414L3_SU_E01_2009-05-01.ab1 1447 0 1447 ABI CTCCAGCCTACCCTCTATCCAGGGGNTCTAGAGGATCCCTCACTCCCACC CTGTGGTCAGAGCTAAAGGAGTGAATGAATTTTATAAAAGGAGAAATCCT TCTGGAAATATTTACTCTCTCTGAATCAATTTGAACTTTGGAAAGCCCCA AGTCACCTTTGCCCTCCACTTCCCACTGCACACTGTTCCCTCTCTTAGAA CTAGCATGTTGGGTGTTATGTAGTCAAAGGAGGGCATTATGAGCTGTACC CCAGGGACTTCCTGATCCTCTTACATGTATAAATAGCAAGACCGGGCCAG GAACAGCAAGCAGTCTGAAGGCCAGCTGGGTCTGCCCACTAAGAAGATGG GTACCGATTTAAATGATCCAGTGGTCCTGCAGAGGAGAGATTGGGAGAAT CCCGGTGTGACACAGCTGAACAGACTAGCCGCCCACCCTCCCTTTGCTTC TTGGAGAAACAGTGAGGAAGCTAGGACAGACAGACCAAGCCAGCAACTCA GATCTTTGAACGGGGAGTGGAGATTTGCCTGGTTTCCGGCACCAGAAGCG GTGCCGGAAAGCTGGCTGGAGTGCGATCTTCCTGAGGCCGATACTGTCGT CGTCCCCTCAAACTGGCAGATGCACGGTTACGATGCGCCCATCTACACCA ACGTGACCTATCCCATTACGGTCAATCCGCCGTTTGTTCCCACGGAGAAT CCGACGGGTTGTTACTCGCTCACATTTAATGTTGATGAAAGCTGGCTACA GGAAGGCCAGACGCGAATTATTTTTGATGGCGTTAACTCGGCGTTTCATC TGTGGTGCAACGGGCGCTGGGTCGGTTACGGCCAGGACAGTCGTTTGCCG TCTGAATTTGACCTGAGCGCATTTTTACGCGCCCGGAGAAAACCGCCCTG CGGTGATGGTGCTGCGCTGGAGTGACGGGCGTTATCTGGAAGATCAGGAT ATGTGGCGGATGAGCGGCATTTTTCCGTGACGTCTTGTTGCTGCATAAAC CGACTACCCAAATCAAACGATTTCCATGTTGCCACTCGCTTTAAATGATG ATTTTCACCCCGCCTGTACTGGAGGCTGAAATTTCAAAATGGCGGGGAGT TGCGGGACTACCCTCCGGGTAAACAGTTTCTTTTATGGCAGGGGTGAAAA CCCAAGGCCGCCCACCGGCCCCGCGGCCCTTTTCGGCCGGGGAAAATTAT CCGATGAAGCGGGGTGGTTTATTGCCCAATCCGCGTCCAACCTACCTTCT GAAAAGGCCCAAAAACCCCGAAAACTGGTGGAGCCCCCCAAAAATTCCCC AAAATTTTTTTTTCTTTGGGGGGGGGGGTTGAAACCTGCACCCCCCCCCC CCCAACGGGCACCCCTTTTTATTTTGAAAAAACCAAAAAACCCCTGCCCG ACTGCTCCCCGGGTTTTTTCCCCCGCGGGAGGAGGGGGCCGGAGAAA >AM_13414L3_pgK.Neo.2fw_H01_2009-05-01.ab1 1387 0 1387 ABI AAGTTCTAATTCATCGNANCTCGCCTGCAGCCCCTAGATAACTTCGTATA ATGTATGCTATACGAAGTTATGCTAGCAGAGAAGGGTGTGAACTGCCAGC TACTTTCTTTGGTCTTCCCCAGTGACCACCTAAGTGGCTCTAAGTGTTTA TTTTTATAGGTATATAAACATTTTTTTTTCTGTTCCACTTTAAAGTGGCA TATCTGGCTTTGTCACAGAGGGGAAACTTGTCTGTGCCAACCCCAGTCAT CTGAAAACTCAGATGCCTGGAAGGTCTGAAGCTGACCTCAATGACTACAC ATAATATTTGATTGAGATAAATGGGCAAGGTCTGGAGAGATGGCTTGGTG GTTAAGAGCACCTGCTGCTCTTCCAGAGGACCTGGGTTCAATTCCCACTT AGATGGCAGCTCAAACTATCTATAATTCCAATTCCAAAGAAAACTGATGC CCTATTTTGCCCTTTAGTTAGTAGTATTTACAGTATTCTTTATAAATTCA CCTTGACATGACCATCTTGAGCTACAGCCATCCTAACTGCCTCAGAATCA CTCAAGTTCTTCCACTCGGTTTCCCAGCGGATTATAAGTGGATAAACTGT GAGAGTGGTCTGTGGGACTTTGGAATGTGTCTGGTTCTGATAGTCACTTA TGGCAACCCGGGTACATTCAACTAGGATGAAATAAATTCTGCCTTAGCCC AGTAGTATGTCTGTGTTTGTAAGGACCCAGCTGATTTTCCCACCACCCCT CCATCAGTAAGCCACTAATAAAGTGCATCTATGCAGCCACAGGTCTGTCT GCCTCTTTTGCTTCAGTTTCCTAGGACTATGGGCTGAAATTGGGCTGTTA GGGAGAAAGCATCTCACTCGCTTTTATTGAATCTGCAGTGGAAAAGAAAC AGAGGGAGTCAGGTAACTTTGAATATTTTCTTCAAAACAAAAGATATCAT GGTACAATTTTTTTTAAATTTTTTGTTTGTTTGTTTTTGTTTTTCGAGAC AGGGTTTCTCTGTGTAGCCCTGGCTGTCCTGGAACTCACTCTGTAGACCA AGTTGGCCTCCAACTCAGAAATCCGCCTGCCTCTGCCTCCTGAGTGCTGG GATTAAAGGCGTGCGCCCCCACCCCCCCGCCCCATGGTCAATTTTTAAAT TTTCCCAAAAATTATTTTTTCCCAAGGTAGACTTCTTTTTAAAGGTGGTT TTTTTACCCCCTTTTGAAAAGAAAACATTAAAGGGGATTCTTCCAAAATT TTGTGAAAGTTTTCCCCGTTTCGAATAAAAAACCCCCCTTTTCCTTTTCC GGGGATCTCCACCCTGGGTGACACTTGGTTTTTTTTACCCCCCCCCCCCT GGCCGGTTTTTTTTTTACCTGGGGGGCCTTGGGTTTA
Any direction would be of great help!

Replies are listed 'Best First'.
Re: processeing records in a file
by ig (Vicar) on Jun 04, 2009 at 22:15 UTC

    I am not confident I understand what you are trying to do, but perhaps the following helps.

    Your REs seem redundant with your transliteration (tr), and they don't do what I guess you want them to do. In particular, you change all A's to T's, but then you change all T's to A's, after which there will be an A everywhere there was either an A or a T. The transliteration will only substitute each letter once, which I guess is what you want. I also don't understand why you reverse the string twice, which restores the original order.

Re: processeing records in a file
by dHarry (Abbot) on Jun 05, 2009 at 10:02 UTC

    Can't you use one of the existing CPAN modules to handle the FASTA stuff? (BioPerlTutorial seems like a good starter.) Selecting certain sequences and reverse complementing them sounds like standard functionality i.e. a method call.