count the occurrence of second line of a paragraph in a file

umaykulsum has asked for the wisdom of the Perl Monks concerning the following question:

I have multiple files like data.txt:

@NS500278:42:HC7M3AFXX:1:11101:16723:1045 1:N:0:AACGTGAT
AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTCTGCTTGAAAA
+AAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+GGGGGGGGGGGG
+
AAAAA#EEEEEEEEEEEEEEEE6EEEEEAEEEAE/AEEEEEEEAE<EEEEA<A/AE<EE/EEAEEAEEAE
+EEEA///EEEEEEEEEAEEEEEEEEEEEEEEEEEEEE/EEEAEEEAEEEEEEEEEAEAEEEEEEEEEEE
+EAEEEEEAEEAA

@NS500278:42:HC7M3AFXX:1:11101:20279:1046 1:N:0:AACGTGAT
TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGGATGGCGCTGT
+TAATCGCAGCAATGGTGTATCCGCAGGGGATTTTTCCGGTACTGGCAGCGTCCGGCGTTTGGGTAGAGA
+TCGGAAGAGCAC
+
AAAAA#EEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/EEEAE6AE<EAEEAEAAEEAEEEEE
+EEAE/EEAEEAEEE6EEEEEAE6A/E<EEEEEEEEAE<EEEEEA/AEEAAEEEEEE//AEE/<<<EEAE
+<66/</AE<<A6

@NS500278:42:HC7M3AFXX:1:11101:18609:1046 1:N:0:AACGTGAT
TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGGATGGCGCTGT
+TAATCGCAGCAATGGTGTATCCGCAGGGGATTTTTCCGGTACTGGCAGCGTCCGGCGTTTGGGTAGAGA
+TCGGAAGAGCAC
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEAEEEAEEEEAEEAEEEE
+AEEEA//EEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
+EEAAAEAEEEEA
[download]

I want to print the whole paragraph (four lines) and the count of second line in the file. The output for above file should be:

@NS500278:42:HC7M3AFXX:1:11101:16723:1045 1:N:0:AACGTGAT
AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTCTGCTTGAAAA
+AAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+GGGGGGGGGGGG
+
AAAAA#EEEEEEEEEEEEEEEE6EEEEEAEEEAE/AEEEEEEEAE<EEEEA<A/AE<EE/EEAEEAEEAE
+EEEA///EEEEEEEEEAEEEEEEEEEEEEEEEEEEEE/EEEAEEEAEEEEEEEEEAEAEEEEEEEEEEE
+EAEEEEEAEEAA 1

@NS500278:42:HC7M3AFXX:1:11101:20279:1046 1:N:0:AACGTGAT
TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGGATGGCGCTGT
+TAATCGCAGCAATGGTGTATCCGCAGGGGATTTTTCCGGTACTGGCAGCGTCCGGCGTTTGGGTAGAGA
+TCGGAAGAGCAC
+
AAAAA#EEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/EEEAE6AE<EAEEAEAAEEAEEEEE
+EEAE/EEAEEAEEE6EEEEEAE6A/E<EEEEEEEEAE<EEEEEA/AEEAAEEEEEE//AEE/<<<EEAE
+<66/</AE<<A6 2
[download]

The code which I have written is:

   use strict;
    use warnings;
my @files=('data.txt');
   for my $input_file (@files) {
   my $output_file = $input_file.".out";


    process_file($input_file, $output_file);
}
sub process_file {
    my($input_file, $output_file) = @_;  
    my %count;

    my $file = shift or die "Usage: $0 FILE\n";
    open my $fh, '<', $file or die "Could not open '$file' $!";
    open my $fa, '>', $output_file;
$/="";
    while (my $line = <$fh>) {

    foreach my $str ($line) {
    chomp $line;

    
    $count{$str}++;

    }
    }

    foreach my $str (sort keys %count) {


    printf $fa "%-s %s", $str."\t", $count{$str};
    print $fa ":".$input_file."\n";
    }
}
[download]

This gives the count of whole paragraph instead of the second line. It matches the whole paragraph but I want to print the whole paragraph and count of only the second line.

Comment on count the occurrence of second line of a paragraph in a file Select or Download Code

Replies are listed 'Best First'.
Re: count the occurrence of second line of a paragraph in a file by choroba (Cardinal) on Apr 29, 2016 at 07:12 UTC
By setting `$/` to `""` , you're reading whole blocks, not lines from the input. The name of the `$line` variable is therefore misleading. Moreover, this doesn't do what you wanted: `foreach my $str ($line)` [download] `$line` is a single thing (scalar), you need to split it into lines manually. I'd create a hash of hashes keyed by the counted string, each inner hash containing the whole block to print and the count: `while (my $block = <$fh>) { chomp $block; my @lines = split /\n/, $block; unless ($seen{ $lines[1] }{count}++) { $seen{ $lines[1] }{block} = $block; } } for my $str (sort keys %seen) { printf {$fa} "%-s %s", @{ $seen{$str} }{qw{ block count }}; print {$fa} ":".$input_file."\n"; } }` [download] ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re: count the occurrence of second line of a paragraph in a file by Discipulus (Canon) on Apr 29, 2016 at 07:02 UTC
welcome to the monastery umaykulsum, you put some effort formatting your questio and showing your attempt, but if i can ask for more, try to show some simplified data; infact having `AAAA` and `ABAA` instead of 160 chars line is the same problem. That said, i notice a first error: you are, for each file to read, reopening also the output one using the mode `>` which will overwrite the file each time. Put the opening of the output file outside of the loop. Also Perl is smart enought to close filehandels for you (wwhen they go outside of a scope) but is better check the `open` and `close` filehandles explicitly. Second you are setting `$/` to null enabling the so called slurp mode. Doing you influence what Perl see as line, becoming different from what you call a line. Finally perhaps i dont understand your question clearly: why you want 4 lines to be printed? why the paragraphs which second line starts with `TACAG` must be associated to the (header?) line that contains `20279` and not with the one containing `18609` ? L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^2: count the occurrence of second line of a paragraph in a file by choroba (Cardinal) on Apr 29, 2016 at 07:15 UTC
> which will overwrite the file each time It's a different file every time (and, as the code is written now, one single time). ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l]
Re^3: count the occurrence of second line of a paragraph in a file by Discipulus (Canon) on Apr 29, 2016 at 07:24 UTC
ah, right! you are correct! L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]