andyBio has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'd would appreciate some help with formatting a fasta file with perl. My input file looks like this:
>dfbdbgf_356dfbdf ATGGCTATCGATT >sdgthhr_478364df ATGGCTATCGATT
I want to have an output file that looks like this:
ATGGCTATCGATT 4 TGCATGCGCTACG 7
Basically, I want to:

-delete the header line and also delete reads that are NOT between 15-30 bp long

-Count the number of times that a read appears and put the value in column 2.

-Delete non-unique reads.

Thanks

Replies are listed 'Best First'.
Re: How to format fasta file
by AnomalousMonk (Archbishop) on Apr 09, 2016 at 23:49 UTC

    Hi andyBio! I'm a PerlMonk, not a BioMonk: you will have to lead me by the hand.

    ... I want to: ... delete reads ... [c]ount ... read[s] ... [d]elete non-unique reads.

    What's a "read"? Also, the sequences in both records of your input example are 'ATGGCTATCGATT', but while the first example output record is 'ATGGCTATCGATT' (with four reads in it), the second output record (from the same input data) is 'TGCATGCGCTACG' with seven (!) reads. More confusion — at least for me. Please see I know what I mean. Why don't you?

    Update: A Wikipedia link ([wp://...]) or suchlike to a discussion of what a "read" is might be helpful. Please see What shortcuts can I use for linking to other information?


    Give a man a fish:  <%-{-{-{-<

      Indeed.

      How should we know how many characters one BP (base pair) is? In this case: 1 byte (a single A,T,G,C letter represents a basepair)

Re: How to format fasta file
by Laurent_R (Canon) on Apr 10, 2016 at 07:58 UTC
    Hi,

    First, your DNA sequences are less than 15 character long. I'll assume it is just a mistake in your example, but this may need further clarification.

    Assuming that you want to:

    - omit lines starting with ">" (headers);

    - remove lines more than 30 or less than 15 characters, and

    - count the number of occurrences of each individual sequence,

    you could do something like this:

    use strict; use warnings; my %count_seq; while (<DATA>) { chomp; next if /^>/; # discard headers next if length($_) > 30 or length($_) < 15; # discard unwanted siz +es $count_seq{$_}++; # count occurrences } print "$_\t$count_seq{$_}\n" for keys %count_seq; __DATA__ >dfbdbgf_356dfbdf ATGGCTGGATATCGATT >sdgthhr_478364df ATGGCTATGGATCAGATT >dfbdbgf_356dfbdf ATGGCTATCGATT >dfbdbgf_356dfbdg ATGGCTGGATATCGATT >sdgthhr_478364df ATGGCTATGGATCGATT >dfbdbgf_356dfbdg ATGGCTGGATATCGATT >sdgthhr_478364df ATGGCTATGGATCGATT TGCATGCGCTATTAGCG ATGGCTATGGATCGATT TGCATGCGCTATTAGCG ATGGCTATGGATCGATT TGCATGCCCTATTAGCG
    This gives me the following result:
    $ perl dna_seq.pl TGCATGCGCTATTAGCG 2 ATGGCTATGGATCAGATT 1 ATGGCTGGATATCGATT 3 ATGGCTATGGATCGATT 4 TGCATGCCCTATTAGCG 1
      Thanks for the response. I tried the code but got an error:
      Use of uninitialized value $_ in scalar chomp at test.pl line 37. Use of uninitialized value $_ in pattern match (m//) at test.pl line 3 +8. Use of uninitialized value $_ in numeric gt (>) at test.pl line 39. Use of uninitialized value $_ in numeric lt (<) at test.pl line 39.
      Lines 37 - 39 are:
      chomp; next if /^>/; # discard headers next if length($_) > 30 or length($_) < 15; # discard unwanted + sizes
      Kindly assist. Thanks

        Assisting: both errors appear to refer to a lack of content in $_ the default variable set by various operations. $_ is explicitly mentioned in your line 3 and implicit in your line 2.

        Inserting a print $_\n before the chomp will probably confirm or rebut my hypothesis. Then, if I'm on target, you can read up to find what you expected to set the default.

        Andy, better change that '<DATA>' into '<STDIN>', then you can either:

        Windows:

        type fasta.txt | perl laurent_example.pl

        Unix:

        cat fasta.txt | perl laurent_example.pl

        And to get rid of your current error, just add this line after the chomp;

        next if($_=~/^\s*$/);

        Look Perlmonks is not a code writing service, we expect you to show efforts to learn programming.

        Hint: check on how to open a file instead of reading appended data like in Laurent's demonstration

        update

        ... and/or avoid empty lines after __DATA__ if you are REALLY only running the demo code!

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

        Hi andyBio,

        the program I posted did not have any of the errors or warnings that you report. This most probably means that the error is somewhere in the changes you made to the code.

        Although I have some ideas on what might be wrong in your program (along the hints supplied by other monks), we can't fix a program that we don't see. The only way we can help you is if you show the code you're now using, with the changes you made.