Reaped: count total number of occurrence in all files

NodeReaper has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: count total number of occurrence in all files by Discipulus (Canon) on May 09, 2016 at 06:58 UTC
hello umaykulsum As you do not ask for something nor point to particular errors i can only guess: first i cannot understand how your data sample can be splitted using `\t\s` pattern (ah ok you are setting `$/` for this..). second i do not see where you are opening files (maybe the script is run under -nl ?) If each file just contains 2 lines you can simply do: `my $key = <$filehandle>; # first line my $value = <$filehandle>; # second line` [download] You also have no need to iterate over an array to get the count: `@array` in scalar context returns the numbers of elements. Also `$#array` contains the index of the last element ( so `$#array + 1 == scalar @array` ). You also want the name of the file to be preserved somehow: you can read about `$ARGV` or you must create a sub that given the filename as arg, open it process it and store the filename along results. A pray to you and to ALL BIONFORMATICS here around: sample data can be `AGATC` instead of a line of hundreds of chars? have readable data as sample help my eyes a lot! L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^2: count total number of occurrence in all files by Eily (Monsignor) on May 09, 2016 at 09:35 UTC
Hello Discipulus :). Actually there is no need for -n, because that's what `while (<>) { }` does (and -n just adds that loops around your code, as can be seen with -MO=Deparse). It opens each file passed as an argument in the filehandle ARGV, or reads from STDIN is @ARGV == 0.	[reply] [d/l]
Re^3: count total number of occurrence in all files by Discipulus (Canon) on May 09, 2016 at 10:00 UTC
Ah! i forgot this part, thanks Eily for the comment. L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re^2: count total number of occurrence in all files by AnomalousMonk (Archbishop) on May 09, 2016 at 13:08 UTC
A pray to you and to ALL BIONFORMATICS here around: sample data can be `AGATC` instead of a line of hundreds of chars? have readable data as sample help my eyes a lot! Yea and amen to that, gentle brother Discipulus! Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: count total number of occurrence in all files by Eily (Monsignor) on May 09, 2016 at 09:31 UTC
Hello umaykulsum, one problem I see with your code is that `while (<>)` shifts (removes) the arguments from @ARGV. So when you are done reading, the size of @ARGV is 0. You can save the count before reading: `my $file_count = @ARGV;` Another issue is that single quote prevent interpolation. Try `print '\t\s';`, you'll see that you don't get a tab and a space, but the string `\t\s`. ~~You want to use double quotes instead.~~ You either want to split on literal string, like "\t " or use a pattern in a regex like /\t\s/ You can optimize your code by doing more work in one loop (untested): `while (<>) { chomp; my ( $key, $value ) = split /\t\s/, $_; # Edited thanks to Anomalous +Monk $compare{$key}{count}++; $compare{$key}{sum} += $value; if ($file_count == $compare{$key}{count}) { print "$key: ", $compare{$key}{sum}; } }` [download] Edit: "\t\s" doesn't work, see AnomalousMonk's answer below :)	[reply] [d/l] [select]
Re^2: count total number of occurrence in all files by AnomalousMonk (Archbishop) on May 09, 2016 at 13:45 UTC
Another issue is that single quote prevent interpolation. ... You want to use double quotes instead. I don't understand the point you're making here. Using double-quote interpolation produces the string `"\ts"`, which doesn't seem to be what the OPer wants to split on at all and which earns you an `"Unrecognized escape ..."` warning into the bargain. Indeed, in your example code, you use the `'\t\s'` (a single tab character followed by any single whitespace character) as the `split` pattern. (Personally, I prefer to use `qr//` or `m//` instead. Potayto, potahto.) c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $line = qq{foo\t bar\t\tbaz\t\nboff}; print qq{>>$line<<}; dd $line; ;; my @fields = split '\t\s', $line; dd \@fields; " >>foo bar baz boff<< "foo\t bar\t\tbaz\t\nboff" ["foo", "bar", "baz", "boff"] c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $line = qq{foo\t bar\t\tbaz\t\nboff}; print qq{>>$line<<}; dd $line; ;; my @fields = split qq{\t\s}, $line; dd \@fields; " Unrecognized escape \s passed through at -e line 1. >>foo bar baz boff<< "foo\t bar\t\tbaz\t\nboff" ["foo\t bar\t\tbaz\t\nboff"] [download] Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: count total number of occurrence in all files by Eily (Monsignor) on May 09, 2016 at 15:35 UTC
You're right, \s doesn't make sense outside a regex, I read it as a single space character, which it actually is not. The fact that I split on '\t\s' is just a copy/paste gone wrong because I did mean to write "\t\s", which would have been incorrect anyway. I'll edit my post, thanks!	[reply]
Re: count total number of occurrence in all files by BillKSmith (Monsignor) on May 09, 2016 at 15:22 UTC
When I download your data files, I find a title and blank line at the start of each. I removed them. Your processing requires a blank line after every record. This includes the last record of every file. I inserted a blank line at the end of each file which did not have one. With these changes, your code created the hash correctly. Your code does not extract the numeric field from the hash value. With the following change, your code produced the output that you expect. `# $tot += $val; $tot += ( split /:/, $val )[0];` [download] Bill	[reply] [d/l]
Re^2: count total number of occurrence in all files by Anonymous Monk on May 10, 2016 at 05:59 UTC
Thank you everyone...now the problem is the script is working perfectly well with small files as shown in example but when I run the script with larger files of 4GB, the script does not give the total count it is giving the count of only first file. why is it happening this way	[reply]
Re^2: count total number of occurrence in all files by Anonymous Monk on May 10, 2016 at 07:35 UTC
sorry friends, the problem is that my file consists of 4 lines but only the second line should match and give the total count. For example the following two files `data.txt @gi AGATC + E/AA# 1 @gi1 ACCTA + /66AE 3` [download] `data1.txt @gi AGATC + //AA# 2 @gi1 ACCTA + #66AE 5` [download] The output should be: `@gi AGATC + E/AA# 3 @gi1 ACCTA + /66AE 8` [download] It should sum the second column of both file only if second line matches. but it is giving the output as: `@gi AGATC + E/AA# 1 @gi1 ACCTA + /66AE 3 @gi AGATC + //AA# 2 @gi1 ACCTA + #66AE 5` [download] It is comparing all the four lines. My script is the same: `my %compare; $/=""; while (<>) { chomp; my ( $key, $value ) = split('\t\s', $_); push( @{ $compare{$key} }, $value ); } foreach my $key ( sort keys %compare ) { my $tot = 0; my $file_count = @ARGV; for my $val ( @{$compare{$key}} ) { $tot += ( split /:/, $val )[0]; } if ( @{ $compare{$key} } >= $file_count) { print join( "\t", $key, $tot, @{ $compare{$key} } ), "\n\n"; } }` [download]	[reply] [d/l] [select]
Re^3: count total number of occurrence in all files by BillKSmith (Monsignor) on May 10, 2016 at 11:58 UTC
Thank you for accepting Discipulus's advice about posting readable examples. Your new question implies that it is possible for two records to have matching sequences (second line), but different ID's (first line). Neither your example nor your code tell us what output you expect in this case. (If this is not possible, you should match only on the much shorter first line.) I see that your code uses my suggestion for parsing the fourth line. Unfortunately, this does not work for your new data files (There is no colon to split on). It is not likely that we can tell you what is wrong with your code until you post code and data that allow us to reproduce your results. Bill	[reply]