umaykulsum has asked for the wisdom of the Perl Monks concerning the following question:

I have multiple files like data.txt:
@NS500278:42:HC7M3AFXX:1:11101:16723:1045 1:N:0:AACGTGAT AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTCTGCTTGAAAA +AAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGG + AAAAA#EEEEEEEEEEEEEEEE6EEEEEAEEEAE/AEEEEEEEAE<EEEEA<A/AE<EE/EEAEEAEEAE +EEEA///EEEEEEEEEAEEEEEEEEEEEEEEEEEEEE/EEEAEEEAEEEEEEEEEAEAEEEEEEEEEEE +EAEEEEEAEEAA @NS500278:42:HC7M3AFXX:1:11101:20279:1046 1:N:0:AACGTGAT TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGGATGGCGCTGT +TAATCGCAGCAATGGTGTATCCGCAGGGGATTTTTCCGGTACTGGCAGCGTCCGGCGTTTGGGTAGAGA +TCGGAAGAGCAC + AAAAA#EEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/EEEAE6AE<EAEEAEAAEEAEEEEE +EEAE/EEAEEAEEE6EEEEEAE6A/E<EEEEEEEEAE<EEEEEA/AEEAAEEEEEE//AEE/<<<EEAE +<66/</AE<<A6 @NS500278:42:HC7M3AFXX:1:11101:18609:1046 1:N:0:AACGTGAT TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGGATGGCGCTGT +TAATCGCAGCAATGGTGTATCCGCAGGGGATTTTTCCGGTACTGGCAGCGTCCGGCGTTTGGGTAGAGA +TCGGAAGAGCAC + AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEAEEEAEEEEAEEAEEEE +AEEEA//EEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE +EEAAAEAEEEEA
I want to print the whole paragraph (four lines) and the count of second line in the file. The output for above file should be:
@NS500278:42:HC7M3AFXX:1:11101:16723:1045 1:N:0:AACGTGAT AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTCTGCTTGAAAA +AAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGG + AAAAA#EEEEEEEEEEEEEEEE6EEEEEAEEEAE/AEEEEEEEAE<EEEEA<A/AE<EE/EEAEEAEEAE +EEEA///EEEEEEEEEAEEEEEEEEEEEEEEEEEEEE/EEEAEEEAEEEEEEEEEAEAEEEEEEEEEEE +EAEEEEEAEEAA 1 @NS500278:42:HC7M3AFXX:1:11101:20279:1046 1:N:0:AACGTGAT TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGGATGGCGCTGT +TAATCGCAGCAATGGTGTATCCGCAGGGGATTTTTCCGGTACTGGCAGCGTCCGGCGTTTGGGTAGAGA +TCGGAAGAGCAC + AAAAA#EEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/EEEAE6AE<EAEEAEAAEEAEEEEE +EEAE/EEAEEAEEE6EEEEEAE6A/E<EEEEEEEEAE<EEEEEA/AEEAAEEEEEE//AEE/<<<EEAE +<66/</AE<<A6 2
The code which I have written is:
use strict; use warnings; my @files=('data.txt'); for my $input_file (@files) { my $output_file = $input_file.".out"; process_file($input_file, $output_file); } sub process_file { my($input_file, $output_file) = @_; my %count; my $file = shift or die "Usage: $0 FILE\n"; open my $fh, '<', $file or die "Could not open '$file' $!"; open my $fa, '>', $output_file; $/=""; while (my $line = <$fh>) { foreach my $str ($line) { chomp $line; $count{$str}++; } } foreach my $str (sort keys %count) { printf $fa "%-s %s", $str."\t", $count{$str}; print $fa ":".$input_file."\n"; } }
This gives the count of whole paragraph instead of the second line. It matches the whole paragraph but I want to print the whole paragraph and count of only the second line.

Replies are listed 'Best First'.
Re: count the occurrence of second line of a paragraph in a file
by choroba (Cardinal) on Apr 29, 2016 at 07:12 UTC
    By setting $/ to "" , you're reading whole blocks, not lines from the input. The name of the $line variable is therefore misleading. Moreover, this doesn't do what you wanted:
    foreach my $str ($line)

    $line is a single thing (scalar), you need to split it into lines manually.

    I'd create a hash of hashes keyed by the counted string, each inner hash containing the whole block to print and the count:

    while (my $block = <$fh>) { chomp $block; my @lines = split /\n/, $block; unless ($seen{ $lines[1] }{count}++) { $seen{ $lines[1] }{block} = $block; } } for my $str (sort keys %seen) { printf {$fa} "%-s %s", @{ $seen{$str} }{qw{ block count }}; print {$fa} ":".$input_file."\n"; } }

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: count the occurrence of second line of a paragraph in a file
by Discipulus (Canon) on Apr 29, 2016 at 07:02 UTC
    welcome to the monastery umaykulsum,

    you put some effort formatting your questio and showing your attempt, but if i can ask for more, try to show some simplified data; infact having AAAA and ABAA instead of 160 chars line is the same problem.

    That said, i notice a first error: you are, for each file to read, reopening also the output one using the mode > which will overwrite the file each time. Put the opening of the output file outside of the loop. Also Perl is smart enought to close filehandels for you (wwhen they go outside of a scope) but is better check the open and close filehandles explicitly. Second you are setting $/ to null enabling the so called slurp mode. Doing you influence what Perl see as line, becoming different from what you call a line.

    Finally perhaps i dont understand your question clearly: why you want 4 lines to be printed? why the paragraphs which second line starts with TACAG must be associated to the (header?) line that contains 20279 and not with the one containing 18609 ?

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      > which will overwrite the file each time

      It's a different file every time (and, as the code is written now, one single time).

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
        ah, right! you are correct!

        L*

        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.