Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Can you make this code shorter and/or quicker as well?

by Anonymous Monk
on Feb 25, 2014 at 00:06 UTC ( [id://1076056]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello all!
I have this code to create oneline-fasta files for some protein sequences.
Oneline means getting all lines that do not start with ">" into a single line. Here are the files and the code I have.
SAMPLE ENTRIES >gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gatt +ii WM276] MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFY RTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNE ECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHL DVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAAD EEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAK QLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKE PAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK >gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Da +phnia pulex] MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYR TIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDE CKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLD KILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEAR RLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSS KPSTPSTPASKRKVGCIIYLFLYF

The code as it is now:
if(@ARGV!=1) { die "correct usage:\n\t perl oneline.pl <fasta_file> "} +; open(FASTA,$ARGV[0]) || die "can't open fasta file"; $line=""; while($line !~ /^>/){ $line=<FASTA>; } $line =~ s/[\r]//g; # remove carriage return print $line; $prev=""; while (1){ $line=<FASTA>; $line=~ s/[\r]//g; # remove carriage return while ($line !~ /^>/){ chomp $line; $prev=$prev.$line; $prev =~ s/[\r]//g; # remove carriage return $line=<FASTA>; if(!(defined $line)) { print $prev."\n"; close(FASTA); exit(1) +;} } print $prev."\n"; $line =~ s/[\r]//g; # remove carriage return print $line; $prev=""; }

Can you help me make it faster and smaller if possible?

Replies are listed 'Best First'.
Re: Can you make this code shorter and/or quicker as well?
by Kenosis (Priest) on Feb 25, 2014 at 00:20 UTC

    Perhaps the following will be helpful:

    use strict; use warnings; local $/ = '>'; while (<>) { chomp; my ( $id, $seq ) = /(.+?\n)(.+)/s or next; $seq =~ s/\s+//g; print ">$id$seq\n"; }

    Command-line usage: perl script.pl fastaIn [>fastaOut]

    The last, optional parameter directs output to a file.

      How can I put it in a sub-routine that I could use inline my script?
      Something like:
      sub extract_query_data($file);
        Sorry, I meant:
        extract_query_data($file);
Re: Can you make this code shorter and/or quicker as well?
by kcott (Archbishop) on Feb 25, 2014 at 01:54 UTC

    This is shorter and I'd be surprised if it's not also faster. I've cut down the data for demo purposes.

    Input file:

    $ cat pm_1076056_in.fasta >gi|321257144|ref|XP_003193485.1| flap ... MGIKGLTG RTIRMVDH ECK >gi|321473340|gb|EFX84308.1| hypothetical ... MGIKGLTQ TIRMVDNG CKQ

    Script:

    $ cat oneline.pl use strict; use warnings; while (<>) { /^>/ ? ($. > 1 && print "\n") : chomp; print; } print "\n";

    Sample run:

    $ perl oneline.pl pm_1076056_in.fasta > pm_1076056_out.fasta

    Output file:

    $ cat pm_1076056_out.fasta >gi|321257144|ref|XP_003193485.1| flap ... MGIKGLTGRTIRMVDHECK >gi|321473340|gb|EFX84308.1| hypothetical ... MGIKGLTQTIRMVDNGCKQ

    -- Ken

      Excellent solution! But it still needs a tr/\r//d in there somewhere :)
        "Excellent solution!"

        Thankyou.

        "But it still needs a tr/\r//d in there somewhere :)"

        I'm not sure that it does. If the input file was created on an MSWin system and then subsequently processed on a *nix system, then some additional handling of line-endings may be appropriate.

        The four instances of "s/[\r]//g" suggest multiple embedded carriage-returns; although, that might just be poorly coded. The single chomp doesn't add a lot of clarity either.

        If the OP can show where carriage-returns are embedded, I can certainly add a transliteration like you suggest.

        -- Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1076056]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-03-29 02:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found