perl_n00b has asked for the wisdom of the Perl Monks concerning the following question:

Hey guys I'm back with another question. I'm trying to split a fasta into individual .seq files. So far I have this...
use strict; use warnings; $/ = ">"; my $fastafile = 'j:\summer\begomo_genomes2.fasta'; my ($OUT, $IN); print "Input: $fastafile\n"; open my $ifh, "<", $fastafile or die "cannot open $fastafile: $!\n"; while (my $chunk = <$ifh>){ $chunk = lc $chunk; my ($accession) = $chunk =~ /gi\|(\d+)/; my ($acronym) = $chunk =~ /\|\s+(.*),/; $acronym =~ s/\[|\]|\:|\d+|\-\///g; $acronym =~ s/\s/_/g; $acronym =~ s/dnaa/dna_a/g; $acronym =~ s/dnab/dna_b/g; my ($sequence) = $chunk =~ /([a|c|t|g]+)\s+/; my $outfile = "j:\\summer\\begomo\\${accession}_${acronym}.seq"; open my $ofh, ">", $outfile or die "cannot open $outfile: $!\n"; print "Creating $outfile\n"; print $ofh "$accession $acronym\n^^\n$sequence"; close $ofh; }
One of my problems is my ($sequence) = $chunk =~ /([a|c|t|g]+)\s+/; I know this isn't right, but I have gone through a lot of different syntax's for this to no avail. What's the best way to parse the sequence out?

My other problem is that $chunk is only the first line, why is it doing this?

Here is a sample of the file I am working with...
>gi|9626081|ref|NC_001359.1| Pepper huasteco yellow vein virus DNA A, +complete sequence GGCCATCCGTTATAATATTACCGGATGGCCGACCGCTTACCTTATCTATCCGTACTGCTTTATTTGAATT AAAGATGTTACTTTTATGCTATCCAATGAAGCGTAGCGTCTGGGAAGCTTAGTTATCAGTTCCAGACGTG GGGACCAAGTAGTGTATGACCACTTTATTGACTGTCAGCTTTATAAATTGAAATTAAAACATAAGTGGTC CATGTACCTTTAATTCAAAATGCCTAAGCGTGATGCTCCTTGGCGATTAACGGCGGGGACCGCCAAGATT AGCCGAACTGGCAATAATTCACGGGCTCTTATCATGGGCCCGAGTACTAGCAGGGCCTCAGCTTGGGTTA ATCGCCCAATGTACAGGAAGCCCCGGATTTATCGTATGTACAGAACTCCGGATGTGCCGAAAGGTTGTGA AGGTCCCTGTAAGGTTCAATCGTTTGAACAACGACATGACGTCTCTCATGTTGGTAAGGTTATTTGTATA TCCGACGTAACTCGTGGTAATGGTATTACCCATCGTGTTGGCAAACGATTCTGCGTTAAGTCTGTCTATA TTCTGGGCAAAATCTGGATGGATGAAAATATTAAGTTGAAGAACCATACCAACAGTGTCATGTTTTGGTT GGTTAGGGATAGGAGACCCTACGGTACGCCTATGGATTTTGGCCAAGTCTTTAACATGTATGACAACGAG CCCAGTACCGCTACTGTGAAGAACGATCTTCGGGATCGTTATCAAGTTATGCATAGATTCTATGCTAAGG TCACTGGTGGGCAATATGCAAGCAACGAGCAAGCCTTGGTTAGGCGTTTCTGGAAGGTGAACAACCATGT TGTGTATAACCATCAAGAAGCTGGGAAATATGAGAACCACACGGAGAATGCGCTGTTATTGTATATGGCA TGTACTCATGCATCTAATCCCGTGTATGCAACACTCAAAATTCGGGTCTATTTTTATGACTCGATAATGA ATTAATAAAGTTTGTATATTATTTCATGATTCTCAAGTACAGCATTGACATAACGTTTGTCTGTAGCAAA CGAAACAGCCCTAATTACATTGTTAACTGAAATAAGACCTAAGTTATCTAGATAAAACATGACAAGCAAT TTAAATCTATTTAAGTAAATCTGCCCAGAAATCGTCGTCAACGTCGTCCAGACTTGGAAGTTGAAGTAGG CTTTGTGGAGACCCAACGCTGTCCTCATGTTGTGGTTTGCTCTGACTTGAATGTGAAATACCCGGCTGCG TGTGTACATTGGCGTCTCCACTAGCCGTATTTTGAAATATAGGGGATTTCGAAGCTCCCAGATAAAAACG CCATTCGCTGCTTGAGCTGCAGTGATGGGTACCCCGGTGCGTAAATCCATTGTTAACACAGTTAATATGT ATATAAATTGAACAGCCGCAAGCGAGATCAATCCTTCTACGTCGTATCTGTCTCTTTGCAAATCTATGGC GAAGTTTGACTTCCGGTGGTGAAGATAGCTTCTTCGATGGTGACGTAGATGGCGTTTTTTTGGACCCAGT CATTGAGGCTCCTATTTTTCTCTTCGCTGAGGTAGTCTTTATAGGAAGACTGGGGGCCCGGATTGCAAAG AAAGATAGTGGGTATCCCACCTTTAATTTGAATTGGTTTGCCGTATTTGCAGTTTGATTGCCAATCCCGT TGGGCCCCCATAAACTCTTTAAAGTGCTTAACATAATGCGGAGGGATGTCATCAATGACGTTATACCATG CATTATTGGAGTAGATTTTTGGGCTGAGATCCATATGACCACATATGTAATTGTGTGGGCCGAGACTTCG GGCCCATAATGTTTTGCCTGTCCGTGAAGGACCCTCGACCACTAATGACAATGGTCTCATTGGCCGCGCA GCGGCATCACATACATTATCAGACACCCATTGTGTCATTATTGCAGGCACATTATTAAAGGACGCCTGTT GAAATGGAGGAACCCACGGTTCCGGGGGAGTTTGGAATATCCGATTAGCGTTTGACACAATGTTATGAAA TTGGAGGAAGAAATGCTGAGGTTGTTCTTCCTTTATGATCTGCAGAGCTTCTTCTGCAGATGCTGAATTT AACGCCTTAGCATATGTGTCATTAGCAGACTGCTGTCCTCCTCTAGCAGATCTGCCGTCTATTTGGAATT CTCCCCATTCTACGGTATCGCCGTCTTTGTCGATGTACGTCTTGACGTCGGAGCTTGATTTAGCTCCCTG AATGTTCGGATGGAAATGTGCTGATCTGGTAGAGGATACGAGGTCAAAGAATCGGTTGTTCGTGCATTGG TATTTTCCTTCGAACTGAATAAGCACGTGCAGATGAGGTTGCCCATCTTCATGAGATTCTTTGCAAATTT TGATGTACTTCTTGTTTACCGGCGTCGAGAGGTTTTGTAGTTGAGCGAGACGCTCTTCTTTGGAAATGGA ACATTGTGGATAGGTGAGGAAATAATTCTTGGCATTTAAACGAAATCGTTTAGGTAATGGCATATTTGTA ATAAGAGAGGTGTACACCGATTGGAGCTCTTTAACCTGGGCTTATTGTATCGGTGTATTGGTAGCCAATA TATAGTATATGGGAGTTATCTAGGATCTTCGTACACGTGAG >gi|9626131|ref|NC_001369.1| Pepper huasteco yellow vein virus DNA B, +complete sequence GGCCATCCGTTATAATATTACCGGATGGCCGACCGCTTCCACTCTCTTTCCTTTGGGACAGCTGGCGCGC ACTATGTATTATGTTTACGTGGCATCATGTGGGTCGTTGGATGAATTCAATCGCGCGCCTTCATTTCAAA TTAAAGTGTGTGTCCATACATCGAGAAATGTGTAATGACGTGGAGCGTTCTCCACCATTCCTGAATCGTT AGATAATTGTTTGACCAGGACCACAGCTGTCATTTGGGACCACACGTCCTTTGGGACCACCACTATAATG ATAATGTTTCCTGTTATTGCGGTCCACGTGGTCCAATTAAATTGCACCTCGCGAGTCTACATATCCACAA TTTTGAATATCCTATTCTATAAAATGGCTTCCATTTTTATATTCAAAATTATATTCACATCTCTTTTAAT ATATATTTATCTTTAAGCAATTTAATATGTATTCTACTAGATTTAGACGTGGGTTATCCTATGTTCCACG GCGTTATAATCCACGTAATTATGGTTTTAAACGTACATTCGTCGTTAAACGTGGTGATGCTAAACGACGT CAGACTCAAGTGAAGAAACTAACAGAAGATGTTAAAATGTCATCACAACGCATCCATGAAAATCAATATG GTCCAGAATTTGTCATGGCGCATAATACAGCAATATCTACATTCATCAATTATCCCCAACTGTGTAAGAC TCAGCCCAATCGTAGTAGGTCATATATTAAGTTAAAATCGTTACATTTTAAGGGAACCTTAAAGATCGAA CGTGTTGGGTCTGAGGTAAATATGGCTGGGTTAAATCCGAAGATTGAGGGTGTGTTTACTGTGGTTTTAG TTGTTGACCGTAAGCCACATTTGAATCCTACTGGTAACTTGCTACAGTTTGACGAGTTATTTGGTGCAAG AATTCACAGTCTAGGGAACTTAGCCGTTACCCCGGCGTTGAAAGAACGGTTCTACATACTGCATGTGTTG AAGCGAGTTATCTCCGTTGAGAAGGATAGTATGATGCTGGACCTAGAAGGATCCACTTGTCTCTCTAGTC GGCGTTATAATTGTTGGTCTACATTTAAGGACCTTGATCCTTCGTCATGTAACGGCGTCTATGATAATAT AAGCAAAAACGCCATATTAGTTTATTATTGTTGGATGTCGGATGCTATGTCTAAGGCATCCACATTTGTA TCATTTGATTTGGACTATTTTGGTTAAGAAATAATTGACTTGCGTAGTTTGCTCATATTTGTATTTTGTC ACAAAATAAAATATTATTATCTTAGCGACTTCGGTTGTGTCGGATTACAATTACTGTTAATACATTCATG GACCGTAGTCCTTACAAGCTCATTCAACTGGGCCAAGGACATAGTTATATTTGATTGAGAGCGTGTTAGA CCCACTTGTGATGCTGAATCACCTGGGTCCAAAACACTTCCGCCTAACTGATGAAGATCTTTATACGGAT GTAATGCGCTATGTCCTTGGTTGTCAGCATCTGTGTGAGTGGTTCCTATGGTGCTTCTACAAGCCCAGGA TTCACCTGGTTTTAATTCAATTGGGCCTGTAATGCCGAACCTTGACATGGATGCTGACCTCAATGGTTTT CTCTCCCACCTGCCGTAGTCCACATGTGTAAAGTCCACATCGTTATGGGTGAACTGTTTCGATAAAATCT TCACCGTCGGAGCCCGGAAAGGTATATCCACGGAGTGTTTAGCTGTGGACAACTTCAATTTCCCTTTGAA CTTGGCAAAATGGGTGTTCTGATGTACGTTAGTATCGGAGACTCTGTAATATAGCTTCCAGGGTATGGGG TCCTTCAAGGAGAAGAAGGATGCTGAGAAATAATGGAGATCGATGTTACATCTTAGTGGAAATGTCCAAG AAGCTTGTAATGATTCATTGTCTGTCATTCGTTTGTCATGGATTTCCACTATGACCGACCCAGTGGCGTT TATCGGAACTTGCTGTCTATACTCGATAACGCAATGGTCAATTTTCATACAGCTACGACTAAGTCTGGCA GCGTACTGCGACGCCGTTGACGGAAATTGAAGTATTATCTCCGTTAAGTCATGAGAGAGCTGATATTCAT CTCTATGTGACTCTATATAATTGAATGCGCTAGGAGGATTCGCCAACCATGAATCCATATATGAAAATTT GGCAGCGCACGTGAAGGCTTACGGAGTCTGAATCTGGTAATAAGAAGCTATACCTAACAATGTTAATGGT AATGAAAATGACAAATTACTATTTGCTGAAAGAGTTCAAAAATAAATGCTTACTTAGTTATTAAGATATT GCTATTAGCAGCAACAATATATGAGGAAACCGGTGAGGATGAAAGCAAAAGCGTCTTCAGAAGACAGAGC AGAAAGAATTGGTATGAATAATTAAATGAACAGGCAGTGTCGTTATATAGAAGATCATTGTGTTTTAGAG AGAGAAAATTTTGCAGTGGCATTTGTGTAATATGGAGGGGTACACCGATTGGAGCTCTTTAACCTGGGCT TATTGTATCGGTGTATTGGTAGCCAATATATAGTATATGGGAGTTATCTAGGATCTTCGTACACGTGGA

Replies are listed 'Best First'.
Re: FASTA Splitter
by John M. Dlugosz (Monsignor) on Jun 01, 2009 at 19:38 UTC
    my ($sequence) = $chunk =~ /([a|c|t|g]+)\s+/;
    Well, I think you are confusing the syntax of character classes and alternation. The stuff in the brackets is a set, so you don't use the OR character also.

    I think your problem is that although you read the whole thing as one record (splitting at the '>'), you forgot that there are still newlines in the string. So, you'll match the first line only, (as I recall, \n is considered whitespace).

    Also you are looking for lower-case and the file contains upper case.

    If you're trying to parse out each big block as you showed it, here's what I'd do: Forget the record-separator stuff. I never touch it, and most people don't.

    Read the first line (loop and try again if the line is blank). Use the split function to separate it at the '|' chars.

    Read lines until you hit a blank line. Those are the big block of data.

    Write it out in the new format.

Re: FASTA Splitter
by citromatik (Curate) on Jun 01, 2009 at 22:39 UTC

    You can also take a look at this module I wrote some time ago. It allows to access a fasta file via a Perl array, so the task could be solved with something like:

    use strict; use warnings; use Tie::File::AnyData::Bio::Fasta; use Fcntl; my $fastafile = 'j:\summer\begomo_genomes2.fasta'; tie my @fastaFile, 'Tie::File::AnyData::Bio::Fasta', $fastafile or die + $!; for my $fastaRec (@fastafile) { my ($header, $rest) = split /\n/,$fastaRec, 2; my ($accession) = $header =~ /gi\|(\d+)/; tie my @outFasta, 'Tie::File::AnyData::Bio::Fasta', "${accession} +.fasta", mode => O_RDWR | O_CREAT or die $!; @outFasta = ($fastaRec); untie @outFasta; } untie @fastaFile;

    citromatik

Re: FASTA Splitter
by lamprecht (Friar) on Jun 01, 2009 at 21:32 UTC
    Hi,

    Maybe something like BioPerl could help you?

    Cheers, Christoph
Re: FASTA Splitter
by perliff (Monk) on Jun 02, 2009 at 08:55 UTC
    try this... something like this should split your fasta file (called bigfasta here) into several small fasta files based on the sequence display id (i assume your sequences have nice looking identifiers). learn to use bioperl to your advantage for reading and writing biological sequence files... its been done by the excellent bioperl project, and you don't want to reinvent the wheel everytime.
    use strict; # always... use Bio::SeqIO; my $bigfasta = "bigfasta.faa"; my $seqin = Bio::SeqIO->new(-file => $bigfasta, -format=>"fasta"); while ($inseq = $seqin->next_seq) { my $id = $inseq->display_id; my $outfile = "$id.fasta"; my $seqout = Bio::SeqIO->new(-file=>">$outfile", -format=>"fas +ta"); $seqout->write_seq($inseq); }
    ----------------------

    "with perl on my side"

    "If you look at the code too long, the code also looks back at you"