How to format fasta file

andyBio has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to format fasta file by AnomalousMonk (Archbishop) on Apr 09, 2016 at 23:49 UTC
Hi andyBio! I'm a PerlMonk, not a BioMonk: you will have to lead me by the hand. ... I want to: ... delete reads ... [c]ount ... read[s] ... [d]elete non-unique reads. What's a "read"? Also, the sequences in both records of your input example are `'ATGGCTATCGATT'`, but while the first example output record is `'ATGGCTATCGATT'` (with four reads in it), the second output record (from the same input data) is `'TGCATGCGCTACG'` with seven (!) reads. More confusion — at least for me. Please see I know what I mean. Why don't you? Update: A Wikipedia link (`[wp://...]`) or suchlike to a discussion of what a "read" is might be helpful. Please see What shortcuts can I use for linking to other information? Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: How to format fasta file by FreeBeerReekingMonk (Deacon) on Apr 10, 2016 at 23:18 UTC
Indeed. How should we know how many characters one BP (base pair) is? In this case: 1 byte (a single A,T,G,C letter represents a basepair)	[reply]
Re: How to format fasta file by Laurent_R (Canon) on Apr 10, 2016 at 07:58 UTC
Hi, First, your DNA sequences are less than 15 character long. I'll assume it is just a mistake in your example, but this may need further clarification. Assuming that you want to: - omit lines starting with ">" (headers); - remove lines more than 30 or less than 15 characters, and - count the number of occurrences of each individual sequence, you could do something like this: use strict; use warnings; my %count_seq; while (<DATA>) { chomp; next if /^>/; # discard headers next if length($_) > 30 or length($_) < 15; # discard unwanted siz +es $count_seq{$_}++; # count occurrences } print "$_\t$count_seq{$_}\n" for keys %count_seq; __DATA__ >dfbdbgf_356dfbdf ATGGCTGGATATCGATT >sdgthhr_478364df ATGGCTATGGATCAGATT >dfbdbgf_356dfbdf ATGGCTATCGATT >dfbdbgf_356dfbdg ATGGCTGGATATCGATT >sdgthhr_478364df ATGGCTATGGATCGATT >dfbdbgf_356dfbdg ATGGCTGGATATCGATT >sdgthhr_478364df ATGGCTATGGATCGATT TGCATGCGCTATTAGCG ATGGCTATGGATCGATT TGCATGCGCTATTAGCG ATGGCTATGGATCGATT TGCATGCCCTATTAGCG [download] This gives me the following result: `$ perl dna_seq.pl TGCATGCGCTATTAGCG 2 ATGGCTATGGATCAGATT 1 ATGGCTGGATATCGATT 3 ATGGCTATGGATCGATT 4 TGCATGCCCTATTAGCG 1` [download]	[reply] [d/l] [select]
Re^2: How to format fasta file by andyBio (Novice) on Apr 10, 2016 at 20:58 UTC
Thanks for the response. I tried the code but got an error: `Use of uninitialized value $_ in scalar chomp at test.pl line 37. Use of uninitialized value $_ in pattern match (m//) at test.pl line 3 +8. Use of uninitialized value $_ in numeric gt (>) at test.pl line 39. Use of uninitialized value $_ in numeric lt (<) at test.pl line 39.` [download] Lines 37 - 39 are: `chomp; next if /^>/; # discard headers next if length($_) > 30 or length($_) < 15; # discard unwanted + sizes` [download] Kindly assist. Thanks	[reply] [d/l] [select]
Re^3: How to format fasta file by ww (Archbishop) on Apr 10, 2016 at 21:36 UTC
Assisting: both errors appear to refer to a lack of content in `$_` the default variable set by various operations. `$_` is explicitly mentioned in your line 3 and implicit in your line 2. Inserting a `print $_\n` before the `chomp` will probably confirm or rebut my hypothesis. Then, if I'm on target, you can read up to find what you expected to set the default. Spirit of the Monastery	[reply] [d/l] [select]
Re^3: How to format fasta file by FreeBeerReekingMonk (Deacon) on Apr 10, 2016 at 23:24 UTC
Andy, better change that `'<DATA>'` into `'<STDIN>'`, then you can either: Windows: `type fasta.txt \| perl laurent_example.pl` Unix: `cat fasta.txt \| perl laurent_example.pl` And to get rid of your current error, just add this line after the chomp; `next if($_=~/^\s*$/);`	[reply] [d/l] [select]
Re^3: How to format fasta file by LanX (Saint) on Apr 10, 2016 at 21:34 UTC
Look Perlmonks is not a code writing service, we expect you to show efforts to learn programming. Hint: check on how to open a file instead of reading appended data like in Laurent's demonstration update ... and/or avoid empty lines after `__DATA__` if you are REALLY only running the demo code! Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l]
Re^3: How to format fasta file by Laurent_R (Canon) on Apr 11, 2016 at 06:18 UTC
Hi andyBio, the program I posted did not have any of the errors or warnings that you report. This most probably means that the error is somewhere in the changes you made to the code. Although I have some ideas on what might be wrong in your program (along the hints supplied by other monks), we can't fix a program that we don't see. The only way we can help you is if you show the code you're now using, with the changes you made.	[reply]
Re^4: How to format fasta file by andyBio (Novice) on Apr 11, 2016 at 18:17 UTC
Re^5: How to format fasta file by Laurent_R (Canon) on Apr 11, 2016 at 18:52 UTC
Some notes below your chosen depth have not been shown here

update