Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Input record separator

by travisbickle34 (Beadle)
on Mar 18, 2005 at 09:46 UTC ( [id://440650]=perlquestion: print w/replies, xml ) Need Help??

travisbickle34 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,
I have an input file containing MANY records in this format:

>Record 1
AGTCTAGTCAT
CATCATAAGAT
CATCAATCACA
>Other Record
ATGAACAGCAG
ATGAAGAATGG
ATAG

I want to read them in from file one at a time, however setting the input record separator to '>' leads to problems i.e. The first record is read in as '>' and the second record is read in as:
Record 1
AGTCTAGTCAT
CATCATAAGAT
CATCAATCACA
>


etc....

Can you suggest a way around this?
For example is it possible to set the input record separator to a newline ONLY when followed by a '>'? In this case the separator should not actually include the '>'.

Your help would be sincerely appreciated.

Regards,

TB34

Replies are listed 'Best First'.
Re: Input record separator
by Zaxo (Archbishop) on Mar 18, 2005 at 10:01 UTC

    Your wish won't work, because the input record seperator cannot be treated as a regular expression.

    You have a pretty common situation, where records are marked by a prefix instead of a postfix seperator. I think the easiest way to deal with that is to set local $/ to "\n>" and take the first record as a special case. Call chomp to remove the "\n>" from the end of the records.

    After Compline,
    Zaxo

      Thanks a lot guys - both very valid and helpful suggestions.
      Being a newbie, I was under the miconception that chomp ONLY removed newlines.
        No, chomp removes whatever currently is in $/, or what it matches (= multiple newlines) in the only case that actually acts like a regexp: paragraph mode.
Re: Input record separator
by monkey_boy (Priest) on Mar 18, 2005 at 09:57 UTC

    Why not just do it the simple way, as chomp will remove the '>' (if $INPUT_RECORD_SEPARATOR = '>'), then skip empty records & re-add the '>',
    somthing like:
    use English; local $INPUT_RECORD_SEPARATOR = '>'; while (my $buf = <$fh>) { # remove '>' & skip empty records: chomp($buf); next if ($buf =~ m/^\s*?$/); my $fasta = ">$buf\n"; };

    *UPDATE*: added use English;


    This is not a Signature...
Re: Input record separator
by BrowserUk (Patriarch) on Mar 18, 2005 at 10:31 UTC

    Try this:

    #! perl -slw use strict; local $/ = "\n>"; while( <DATA> ) { chomp; my( $label, $data ) = m[\A>?(.*?)\n(.*)$]sm; $data =~ tr[\n][]d; printf "%20s : %s\n", $label, $data; } =Output P:\test>junk3 Record 1 : AGTCTAGTCATCATCATAAGATCATCAATCACA Other Record 1 : ATGAACAGCAGATGAAGAATGGATAG Record 2 : AGTCTAGTCATCATCATAAGATCATCAATCACA Other Record 2 : ATGAACAGCAGATGAAGAATGGATAG =cut __DATA__ >Record 1 AGTCTAGTCAT CATCATAAGAT CATCAATCACA >Other Record 1 ATGAACAGCAG ATGAAGAATGG ATAG >Record 2 AGTCTAGTCAT CATCATAAGAT CATCAATCACA >Other Record 2 ATGAACAGCAG ATGAAGAATGG ATAG

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco.
    Rule 1 has a caveat! -- Who broke the cabal?
Re: Input record separator
by MadraghRua (Vicar) on Mar 18, 2005 at 21:16 UTC
    Go check out BioPerl. They already have modules to deal with the problem and tutorials to help you to figure it out. It saves revinventing the wheel.

    An alternative is check out the O'Reilly Tisdale book - beginning Perl for Bioinformatics - again thre are many good examples around this type of problem in there.

    MadraghRua
    yet another biologist hacking perl....

      Even better.... check out the EMBOSS suite of programs (but you need a system administrator to install it into Linux).
Re: Input record separator
by tlm (Prior) on Mar 18, 2005 at 15:02 UTC

    I have an input file containing MANY records in this format:

    >Record 1 AGTCTAGTCAT CATCATAAGAT CATCAATCACA >Other Record ATGAACAGCAG ATGAAGAATGG ATAG

    Ah, yes. The ol' FASTA. IF you know that you have no repeats in the id/description info after the '>', the following "one liner" gets you a long way with a standard FASTA file:

    % perl -lne 's/>//?$s=$_:$s{$s}.=$_;\ > END{ <do smthng w/ %s> }' \ > huge.fasta massive.fasta humongo.fasta
    The uniqueness condition mentioned above is crucial, otherwise this scriptlet will mess with your mind.

    A generically useful specialization of the above is

    % perl -MStorable -lne 's/>//?$s=$_:$s{$s}.=$_;\ > END{ store \%s "for_later" }' \ > huge.fasta massive.fasta humongo.fasta
    Then you can read the-hash-formerly-known-as-%s from any script whenever you please. See Storable. Keep in mind, however, that, if left unattended, Storable::store clobbers without remorse.

    the lowliest monk

Re: Input record separator
by fletcher_the_dog (Friar) on Mar 18, 2005 at 16:17 UTC
    If your data is consitant, just chuck the first one
    use strict; open FILE,"data.txt" or die "could not open data.txt: $!\n"; # set seperator to ">" local $/ = ">"; # chuck first one my $junk = <FILE>; while(my $record = <FILE>){ chomp $record; # do important stuff }
Re: Input record separator
by TedPride (Priest) on Mar 18, 2005 at 18:33 UTC
    You haven't stated what you want to do with the data once you have it. So I've written the following for both single item lookup and output in order, using the same set of data storage:
    use strict; use warnings; my ($key, %hash, @order); while (<DATA>) { $key = substr($_,1,length($_)-2); $hash{$key} = <DATA>.<DATA>.<DATA>; push @order, $key; } # To look up a specific record: $key = 'Record 3'; print "$key =>\n$hash{$key}\n"; # To output the records in order: print "$_ =>\n$hash{$_}\n" for @order; __DATA__ >Record 1 AGTCTAGTCAT CATCATAAGAT CATCAATCACA >Record 2 ATGAACAGCAG ATGAAGAATGG ATAG >Record 3 AGTCTAGTCAT CATCATAAGAT CATCAATCACA >Record 4 ATGAACAGCAG ATGAAGAATGG ATAG
regex record separator is certainly possible
by inq123 (Sexton) on Mar 20, 2005 at 01:33 UTC
    but certainly not recommended. :) Using File::Stream one could set $/ to regex, but this approach suffers from several caveats and I wouldn't recommend it.

    Aside from this, the suggestions above are quite good and covered most what I would suggest. But just to add a bit value to the discussion, purely IMHO, the best approach (as already suggested) might be to use bioperl, 'cause who doesn't want to have somebody else taking care of any format change and deal with potential problems therein? :)

    Another thing is that set $/ = "\n>" is a correct approach, but ">" is not since FASTA format does not demands that seq description not have '>' in it. I would also certainly set performance as the highest priority in dealing with FASTA format (if I choose not to use Bioperl for some reason) thus code like the following would be an OK alternative to using bioperl:

    $/ = "\n>"; while (<DATA>) { chomp; my $seq = /^>/ ? "$_\n" : ">$_\n"; print "seq is:\n$seq"; } __DATA__ >Record 1 AGTCTAGTCAT CATCATAAGAT CATCAATCACA >Record 2 ATGAACAGCAG ATGAAGAATGG ATAG >Record 3 AGTCTAGTCAT CATCATAAGAT CATCAATCACA >Record 4 ATGAACAGCAG ATGAAGAATGG ATAG
    Now the above solution is all good, until you consider using it on a FASTA file generated on Mac. So maybe we need File::Stream after all? But it is not solution for huge files. So maybe we just use Bioperl and hope they had dealt with this issue?

    Or maybe I'm just making this simple issue sounding more and more complicated? Now that's my only gift. :)

    Still, hope it helps. ;)

    Updated

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://440650]
Approved by Corion
Front-paged by bart
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2024-03-28 11:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found