in reply to How to format fasta file
First, your DNA sequences are less than 15 character long. I'll assume it is just a mistake in your example, but this may need further clarification.
Assuming that you want to:
- omit lines starting with ">" (headers);
- remove lines more than 30 or less than 15 characters, and
- count the number of occurrences of each individual sequence,
you could do something like this:
This gives me the following result:use strict; use warnings; my %count_seq; while (<DATA>) { chomp; next if /^>/; # discard headers next if length($_) > 30 or length($_) < 15; # discard unwanted siz +es $count_seq{$_}++; # count occurrences } print "$_\t$count_seq{$_}\n" for keys %count_seq; __DATA__ >dfbdbgf_356dfbdf ATGGCTGGATATCGATT >sdgthhr_478364df ATGGCTATGGATCAGATT >dfbdbgf_356dfbdf ATGGCTATCGATT >dfbdbgf_356dfbdg ATGGCTGGATATCGATT >sdgthhr_478364df ATGGCTATGGATCGATT >dfbdbgf_356dfbdg ATGGCTGGATATCGATT >sdgthhr_478364df ATGGCTATGGATCGATT TGCATGCGCTATTAGCG ATGGCTATGGATCGATT TGCATGCGCTATTAGCG ATGGCTATGGATCGATT TGCATGCCCTATTAGCG
$ perl dna_seq.pl TGCATGCGCTATTAGCG 2 ATGGCTATGGATCAGATT 1 ATGGCTGGATATCGATT 3 ATGGCTATGGATCGATT 4 TGCATGCCCTATTAGCG 1
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: How to format fasta file
by andyBio (Novice) on Apr 10, 2016 at 20:58 UTC | |
by ww (Archbishop) on Apr 10, 2016 at 21:36 UTC | |
by FreeBeerReekingMonk (Deacon) on Apr 10, 2016 at 23:24 UTC | |
by LanX (Saint) on Apr 10, 2016 at 21:34 UTC | |
by Laurent_R (Canon) on Apr 11, 2016 at 06:18 UTC | |
by andyBio (Novice) on Apr 11, 2016 at 18:17 UTC | |
by Laurent_R (Canon) on Apr 11, 2016 at 18:52 UTC | |
by andyBio (Novice) on Apr 12, 2016 at 00:22 UTC |