How to count the length of a sequence of alphabets and number of occurence of a particular alphabet in the sequence?

davi54 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a very basic question. So, I have a list of protein sequences, where each entry has a header followed by the actual sequence (all uppercase alphabet sequence) and separated from the next entry with a blank new line (as shown in the example below). Is there a way where the script reads the uppercase sequence and outputs the length of the sequence and the number of times the alphabet A occurs in that sequence? So, for the two example sequences below:

>sp|O24310|EFTU_PEA Elongation factor Tu, chloroplastic OS=Pisum sativum OX=3888 GN=TUFA PE=2 SV=1

MALSSTAATTSSKLKLSNPPSLSHTFTASASASVSNSTSFR

>sp|Q43467|EFTU1_SOYBN Elongation factor Tu, chloroplastic OS=Glycine max OX=3847 GN=TUFA PE=3 SV=1

MAVSSATASSKLILLPHASSSSSLNSTPFRSSTTNTHKLTPADSTHNIKL

I want the output to look like:

Sequence: MALSSTAATTSSKLKLSNPPSLSHTFTASASASVSNSTSFR

Length: 41

A: 6

Sequence: MAVSSATASSKLILLPHASSSSSLNSTPFRSSTTNTHKLTPADSTHNIKL

Length: 50

A: 5

Any help would be appreciated. Thank you so much.

Comment on How to count the length of a sequence of alphabets and number of occurence of a particular alphabet in the sequence?

Replies are listed 'Best First'.
Re: How to count the length of a sequence of alphabets and number of occurence of a particular alphabet in the sequence? by rjt (Curate) on Oct 07, 2019 at 17:36 UTC
You can count the number of occurrences of a particular character in a string with tr: `use 5.010; + my $str1 = 'MALSSTAATTSSKLKLSNPPSLSHTFTASASASVSNSTSFR'; my $a_count = $str1 =~ tr/A/A/; # 6` [download] length will give you the length of the entire string. Now to actually pull out the uppercase sequence from your sample input, are you reading lines from a file? Something like this would probably work: `#!/usr/bin/env perl use 5.010; for (<>) { if (/^>/) { # Header } elsif (/^[A-Z]+$/) { # Protein my $a = tr/A/A/; say "A: $a, length: " . length; } }` [download] Then simply run it with `script.pl < protein.txt`. Modify the `say ...` line to taste, or more likely, replace it with the rest of your logic. You can also choose to parse the header if needed, in the `# Header` section. You could of course modify this to actually open the file in your script with open instead, if that is more desirable: `open my $fh, '<', $filename or die "Couldn't open $filename: $!"; for (<$fh>) {` [download] `use strict; use warnings;` omitted for brevity.	[reply] [d/l] [select]
Re^2: How to count the length of a sequence of alphabets and number of occurence of a particular alphabet in the sequence? by davi54 (Sexton) on Oct 07, 2019 at 18:08 UTC
Hey, this worked perfect!! I have a follow-up question. Is there a way to get the output in a file instead of terminal? And is there a way to bin the outputs for different string lengths? Thanks a ton.	[reply]
Re^3: How to count the length of a sequence of alphabets and number of occurence of a particular alphabet in the sequence? by rjt (Curate) on Oct 07, 2019 at 18:30 UTC
These are indeed basic questions, as pointed out by stevieb. There are several ways to output to a file. Your operating system itself can probably do it with output redirection, or from within Perl, the open command can write to files instead of reading. Click on the links in the preceding sentence for more information and examples. To put the outputs into bins for different string lengths, hashes are an excellent way to do that. I would use a hash of array refs. The perldata page is again another great documentation resource that will introduce you to the required concepts. The basic algorithm in your case would be to do everything you're already doing, but instead of displaying the "A" count and length with `say`, you would instead store it in a hash. The hash key would be the `length`, and the hash value would be an array ref. Untested: `my $len = length; $bins{$len} //= [ ]; # Set to blank array ref if not already set. push @{$bins{$len}}, $a; # Add $a to the array` [download] When it's all over, `%bins` (which you will of course need to declare before the main loop), has your A counts: See sort for an explanation of how to sort your data. `for my $len (sort { $a <=> $b } keys %bins) { say "Length $len:"; say for @{$bins{$_}}; }` [download] Some assembly and individual research required. :-) `use strict; use warnings;` omitted for brevity.	[reply] [d/l] [select]
Re^3: How to count the length of a sequence of alphabets and number of occurence of a particular alphabet in the sequence? by stevieb (Canon) on Oct 07, 2019 at 18:15 UTC
Welcome davi54! This site is about getting help for coding problems. Opening files, the question you just asked about is one of the most trivial and widely used pieces of functionality that any Perl programmer must learn early. Hint: once you get your file handle opened successfully, supply it to the print function before the information you want to print to it: `print $fh "what I want to print to file\n";`. Please do some of your own homework, and give this part a try instead of having others write all of your code for you. See open. For the latter question when you say "bin", do you mean trash bin? I don't quite understand what "bin the outputs" means.	[reply] [d/l]
Re^4: How to count the length of a sequence of alphabets and number of occurence of a particular alphabet in the sequence? by davi54 (Sexton) on Oct 11, 2019 at 15:05 UTC
Re^5: How to write to a file? by hippo (Archbishop) on Oct 11, 2019 at 15:47 UTC
Some notes below your chosen depth have not been shown here
Re: How to count the length of a sequence of alphabets and number of occurence of a particular alphabet in the sequence? by 1nickt (Canon) on Oct 07, 2019 at 17:53 UTC
Hi, welcome to Perl, the One True Religion. You have chosen the right language for your task; Perl excels at processing large amounts of data. Some people have been working in the field for quite some time; have you seen BioPerl? Hope this helps! The way forward always starts with a minimal test.	[reply]