Getting unique line counts between lines starting with '>'

james.v has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I am trying to alter this script:

perl -lne '/>/ && do {print $c if defined $c; $c = 0; print} || $c++; END {print $c}' input_file > output_file

In order to make the output counts for only unique lines that are between lines starting with '>'

Currently this script works well in generating overall counts of the lines that are between lines that begin with '>'

Example input_file:

>05143_African_trypanosomiasis
TRINITY_DN26760_c1_g1
18169
42987
42987
>05145_Toxoplasmosis
43736
38319
38320
38320
TRINITY_DN24151_c3_g1
TRINITY_DN25493_c0_g1
[download]

Example of output_file:

>05143_African_trypanosomiasis
4
>05145_Toxoplasmosis
6
[download]

Example of desired output:

>05143_African_trypanosomiasis
3
>05145_Toxoplasmosis
5
[download]

I'm also very new to perl so an explanation of the tweaked code would be much appreciated.

-James

Comment on Getting unique line counts between lines starting with '>' Select or Download Code

Replies are listed 'Best First'.
Re: Getting unique line counts between lines starting with '>' by choroba (Cardinal) on Oct 19, 2017 at 21:34 UTC
Use a hash of the lines, keys of a hash are always unique. `perl -lne 'sub out {print $h, "\n", scalar keys %c if %c } />/ and do { out(); %c =(); $h = $_ } or $c{$_}++; END { out() }' < input-file` [download] ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^2: Getting unique line counts between lines starting with '>' by james.v (Initiate) on Oct 20, 2017 at 00:17 UTC
A quick question. Is there a way to get the keys of the %s hash to be printed in a comma-delimited list? For example: `>05143_African_trypanosomiasis 3 TRINITY_DN26760_c1_g1, 18169, 42987 >05145_Toxoplasmosis 5 43736, 38319, 38320, TRINITY_DN24151_c3_g1, TRINITY_DN25493_c0_g1` [download] best, James	[reply] [d/l]
Re^3: Getting unique line counts between lines starting with '>' by NetWallah (Canon) on Oct 20, 2017 at 02:53 UTC
Update: Corrected for Comma-separated numbers. `$ perl -lne 'sub prt{@c && print scalar @c,"\n",join ", ",@c;@c=();pri +nt} />/?prt:push @c,$_}{prt' TheFileName` [download] output `>05143_African_trypanosomiasis 4 TRINITY_DN26760_c1_g1, 18169, 42987, 42987 >05145_Toxoplasmosis 7 43736, 38319, 38320, 38320, TRINITY_DN24151_c3_g1, TRINITY_DN25493_c0_ +g1` [download] For info on "}{", see "Eskimo greeting" in perlsecret. All power corrupts, but we need electricity.	[reply] [d/l] [select]
Re^4: Getting unique line counts between lines starting with '>' by Anonymous Monk on Oct 20, 2017 at 04:45 UTC
Re^2: Getting unique line counts between lines starting with '>' by james.v (Initiate) on Oct 19, 2017 at 21:51 UTC
Thank you for the help Choroba, works like a charm! best, James	[reply]