Computing the percentage of certain characters in a file

james28909 has asked for the wisdom of the Perl Monks concerning the following question:

Dear esteemed Monks, i have a problem which i cannot figure out. this is also in reguards to the whole program/gui that i am making. (yes i know its taing me forever, but what a learning expirience :) )

Anyways, i have made a script that "should" return the percentage of given characters in a file. the file i am trying to find the amount of characters in, is a binary file thats unpacked with 'H*' for its hex representation, then i search this hex for characters. here is the code:

use File::Slurp;

my $file = read_file("file.bin");
my $data = unpack( 'H*', $file );

my $count =()= $data =~ /ff/g;
my $size = (stat("file.bin"))[7];

my $dec = $count/$size;
my $percentage = ($dec*100);

print "$count\n";
print "$size\n";
print $percentage;
[download]

it is returning an almost correct value of 10.65, but that is a little higher than what HxD hex editor reports in statistics... which is 10.43. Also i am pretty sure my math is right, but im not the best mathamagician though, lol

Any help would be appreciated :)

Comment on Computing the percentage of certain characters in a file Download Code

Replies are listed 'Best First'.
Re: Computing the percentage of certain characters in a file by AnomalousMonk (Archbishop) on Aug 04, 2014 at 23:38 UTC
How many `'ff'` character pair sequences are in the string `qq{\x0f\xf0\xff}` after you've unpack-ed the string? `c:\@Work\Perl\monks\perl -wMstrict -le "my $file = qq{\x0f\xf0\xff}; my $data = unpack( 'H', $file ); my $count =()= $data =~ /ff/g; print $count; " 2` [download] If you want to count the number of `0xff` characters* in the raw file, maybe better to concentrate on `0xff`: `c:\@Work\Perl\monks>perl -wMstrict -le "my $file = qq{\x0f\xf0\xff}; my $count =()= $file =~ /\xff/g; print $count; " 1` [download] Or perhaps better with `tr///` (update: see Quote-Like Operators in perlop): `c:\@Work\Perl\monks>perl -wMstrict -le "my $file = qq{\x0f\xf0\xff}; my $count = $file =~ tr/\xff//; print $count; " 1` [download] Update: I haven't checked this, but if you're running under Windoze, there may be a problem arising from the fact that Windose uses a `\x0d\x0a` character pair to represent a newline in a file, but this may be translated into a `\n` (newline) single character when the file is read depending on the read mode being used, e.g., binmode. stat will report the number of characters the operating system sees, i.e., the number before any file-read translation, and this may throw your calculation off a bit versus what the HxD hex editor (whatever that is and however it works) reports.	[reply] [d/l] [select]
Re^2: Computing the percentage of certain characters in a file by james28909 (Deacon) on Aug 05, 2014 at 02:24 UTC
yes it works great, and actually, it seems HxD rounds to the nearest thousandth. now i am computing 10.4279696941376 % while hxd is computing 10.43 % Also the tr/\xff// method is ALOT faster so thank you for sharing that little bit of info :)	[reply]
Re^3: Computing the percentage of certain characters in a file by james28909 (Deacon) on Aug 05, 2014 at 05:03 UTC
and also i want to thank you guys and gals for helping me thru my way, i have learned a good bit but i am far from where i want to be. but thanks again for evenryones help :)	[reply]
Re^2: Computing the percentage of certain characters in a file by james28909 (Deacon) on Aug 06, 2014 at 00:02 UTC
tr/// works very quickly and is what i need, but after some searching around, i found it is not possible to use a variable with tr/// as in "tr/\$_//". so that means in order to get statistics of the whole file, i am going to have to write out =~tr/\x00// all the way thru =~ tr/\xff// lol 255 different instances. not a big deal but i could have made what i have now into a subroutine and passed each element out of an array (x00 - xff) to tr///.	[reply]
Re^3: Computing the percentage of certain characters in a file by AnomalousMonk (Archbishop) on Aug 06, 2014 at 08:39 UTC
It sounds like this might have been an XY Problem: "How do I count the occurrences of each character in a string/file/etc?" (Caution: The following solution only works for byte characters (i.e., 1 byte == 1 character), not Unicode characters.) c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'A man, a plan, a canal: Panama!'; ;; my %freq; ++$freq{ substr $s, $_, 1 } for 0 .. length($s) - 1; ;; printf qq{'$_' (0x%02x) == $freq{$_} (%6.3f%%) \n}, ord, $freq{$_} / length($s) * 100 for sort { ord($a) <=> ord($b) } keys %freq; " ' ' (0x20) == 6 (19.355%) '!' (0x21) == 1 ( 3.226%) ',' (0x2c) == 2 ( 6.452%) ':' (0x3a) == 1 ( 3.226%) 'A' (0x41) == 1 ( 3.226%) 'P' (0x50) == 1 ( 3.226%) 'a' (0x61) == 9 (29.032%) 'c' (0x63) == 1 ( 3.226%) 'l' (0x6c) == 2 ( 6.452%) 'm' (0x6d) == 2 ( 6.452%) 'n' (0x6e) == 4 (12.903%) 'p' (0x70) == 1 ( 3.226%) [download]	[reply] [d/l]
Re^4: Computing the percentage of certain characters in a file by james28909 (Deacon) on Aug 07, 2014 at 04:19 UTC
Re^5: Computing the percentage of certain characters in a file by AnomalousMonk (Archbishop) on Aug 07, 2014 at 09:02 UTC
Some notes below your chosen depth have not been shown here