james28909 has asked for the wisdom of the Perl Monks concerning the following question:

Dear esteemed Monks, i have a problem which i cannot figure out. this is also in reguards to the whole program/gui that i am making. (yes i know its taing me forever, but what a learning expirience :) )

Anyways, i have made a script that "should" return the percentage of given characters in a file. the file i am trying to find the amount of characters in, is a binary file thats unpacked with 'H*' for its hex representation, then i search this hex for characters. here is the code:
use File::Slurp; my $file = read_file("file.bin"); my $data = unpack( 'H*', $file ); my $count =()= $data =~ /ff/g; my $size = (stat("file.bin"))[7]; my $dec = $count/$size; my $percentage = ($dec*100); print "$count\n"; print "$size\n"; print $percentage;
it is returning an almost correct value of 10.65, but that is a little higher than what HxD hex editor reports in statistics... which is 10.43. Also i am pretty sure my math is right, but im not the best mathamagician though, lol

Any help would be appreciated :)

Replies are listed 'Best First'.
Re: Computing the percentage of certain characters in a file
by AnomalousMonk (Archbishop) on Aug 04, 2014 at 23:38 UTC

    How many  'ff' character pair sequences are in the string  qq{\x0f\xf0\xff} after you've unpack-ed the string?

    c:\@Work\Perl\monks\perl -wMstrict -le "my $file = qq{\x0f\xf0\xff}; my $data = unpack( 'H*', $file ); my $count =()= $data =~ /ff/g; print $count; " 2

    If you want to count the number of  0xff characters in the raw file, maybe better to concentrate on 0xff:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $file = qq{\x0f\xf0\xff}; my $count =()= $file =~ /\xff/g; print $count; " 1
    Or perhaps better with  tr/// (update: see Quote-Like Operators in perlop):
    c:\@Work\Perl\monks>perl -wMstrict -le "my $file = qq{\x0f\xf0\xff}; my $count = $file =~ tr/\xff//; print $count; " 1

    Update: I haven't checked this, but if you're running under Windoze, there may be a problem arising from the fact that Windose uses a  \x0d\x0a character pair to represent a newline in a file, but this may be translated into a  \n (newline) single character when the file is read depending on the read mode being used, e.g., binmode. stat will report the number of characters the operating system sees, i.e., the number before any file-read translation, and this may throw your calculation off a bit versus what the HxD hex editor (whatever that is and however it works) reports.

      yes it works great, and actually, it seems HxD rounds to the nearest thousandth. now i am computing 10.4279696941376 % while hxd is computing 10.43 %

      Also the tr/\xff// method is ALOT faster so thank you for sharing that little bit of info :)
        and also i want to thank you guys and gals for helping me thru my way, i have learned a good bit but i am far from where i want to be. but thanks again for evenryones help :)
      tr/// works very quickly and is what i need, but after some searching around, i found it is not possible to use a variable with tr/// as in "tr/\$_//". so that means in order to get statistics of the whole file, i am going to have to write out =~tr/\x00// all the way thru =~ tr/\xff// lol 255 different instances. not a big deal but i could have made what i have now into a subroutine and passed each element out of an array (x00 - xff) to tr///.

        It sounds like this might have been an XY Problem: "How do I count the occurrences of each character in a string/file/etc?" (Caution: The following solution only works for byte characters (i.e., 1 byte == 1 character), not Unicode characters.)

        c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'A man, a plan, a canal: Panama!'; ;; my %freq; ++$freq{ substr $s, $_, 1 } for 0 .. length($s) - 1; ;; printf qq{'$_' (0x%02x) == $freq{$_} (%6.3f%%) \n}, ord, $freq{$_} / length($s) * 100 for sort { ord($a) <=> ord($b) } keys %freq; " ' ' (0x20) == 6 (19.355%) '!' (0x21) == 1 ( 3.226%) ',' (0x2c) == 2 ( 6.452%) ':' (0x3a) == 1 ( 3.226%) 'A' (0x41) == 1 ( 3.226%) 'P' (0x50) == 1 ( 3.226%) 'a' (0x61) == 9 (29.032%) 'c' (0x63) == 1 ( 3.226%) 'l' (0x6c) == 2 ( 6.452%) 'm' (0x6d) == 2 ( 6.452%) 'n' (0x6e) == 4 (12.903%) 'p' (0x70) == 1 ( 3.226%)