in reply to Re^2: Computing the percentage of certain characters in a file
in thread Computing the percentage of certain characters in a file

It sounds like this might have been an XY Problem: "How do I count the occurrences of each character in a string/file/etc?" (Caution: The following solution only works for byte characters (i.e., 1 byte == 1 character), not Unicode characters.)

c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'A man, a plan, a canal: Panama!'; ;; my %freq; ++$freq{ substr $s, $_, 1 } for 0 .. length($s) - 1; ;; printf qq{'$_' (0x%02x) == $freq{$_} (%6.3f%%) \n}, ord, $freq{$_} / length($s) * 100 for sort { ord($a) <=> ord($b) } keys %freq; " ' ' (0x20) == 6 (19.355%) '!' (0x21) == 1 ( 3.226%) ',' (0x2c) == 2 ( 6.452%) ':' (0x3a) == 1 ( 3.226%) 'A' (0x41) == 1 ( 3.226%) 'P' (0x50) == 1 ( 3.226%) 'a' (0x61) == 9 (29.032%) 'c' (0x63) == 1 ( 3.226%) 'l' (0x6c) == 2 ( 6.452%) 'm' (0x6d) == 2 ( 6.452%) 'n' (0x6e) == 4 (12.903%) 'p' (0x70) == 1 ( 3.226%)

Replies are listed 'Best First'.
Re^4: Computing the percentage of certain characters in a file
by james28909 (Deacon) on Aug 07, 2014 at 04:19 UTC
    i want you to know... i wrote out 1853 lines to achieve what you were able to do in 7 lines lol. and not only that, but on a larger 256 MB file, your code is 3 secs faster(39 secs) as opposed to mine (42 secs). tho i am not sure how HxD is able to do it, but it gets these same statistics on a 256 MB file in 3-4 secs flat.

    But, i am going to study this code because it would have saved me hours if i knew exactly how do it this way in the first place. and thanks for sharing :)
    my $file = read_file($ARGV[0], { binmode => ':raw' }); my $x00 = $file =~ tr/\x00//; my $size = (stat($ARGV[0]))[7]; my $dec = $x00/$size; my $pc = ($dec*100); my $percentage00 = sprintf("%.2f", $pc); ..... same code, but tr/\x01// thru tr/\xfe// ..... my $xff = $file =~ tr/\xff//; my $dec = $xff/$size; my $pc = ($dec*100); my $percentageff = sprintf("%.2f", $pc); close($file); my $sum = $percentage01+$percentage02+$percentage03+$percentage04+$p +ercentage05+$percentage06+ $percentage07+$percentage08+$percentage09+$percentage0a+$p +ercentage0b+$percentage0c+ $percentage0d+$percentage0e+$percentage0f+$percentage10+$p +ercentage11+$percentage12+ $percentage13+$percentage14+$percentage15+$percentage16+$p +ercentage17+$percentage18+ $percentage19+$percentage1a+$percentage1b+$percentage1c+$p +ercentage1d+$percentage1e+ $percentage1f+$percentage20+$percentage21+$percentage22+$p +ercentage23+$percentage24+ $percentage25+$percentage26+$percentage27+$percentage28+$p +ercentage29+$percentage2a+ $percentage2b+$percentage2c+$percentage2d+$percentage2e+$p +ercentage2f+$percentage30+ $percentage31+$percentage32+$percentage33+$percentage34+$p +ercentage35+$percentage36+ $percentage37+$percentage38+$percentage39+$percentage3a+$p +ercentage3b+$percentage3c+ $percentage3d+$percentage3e+$percentage3f+$percentage40+$p +ercentage41+$percentage42+ $percentage43+$percentage44+$percentage45+$percentage46+$p +ercentage47+$percentage48+ $percentage49+$percentage4a+$percentage4b+$percentage4c+$p +ercentage4d+$percentage4e+ $percentage4f+$percentage50+$percentage51+$percentage52+$p +ercentage53+$percentage54+ $percentage55+$percentage56+$percentage57+$percentage58+$p +ercentage59+$percentage5a+ $percentage5b+$percentage5c+$percentage5d+$percentage5e+$p +ercentage5f+$percentage60+ $percentage61+$percentage62+$percentage63+$percentage64+$p +ercentage65+$percentage66+ $percentage67+$percentage68+$percentage69+$percentage6a+$p +ercentage6b+$percentage6c+ $percentage6d+$percentage6e+$percentage6f+$percentage70+$p +ercentage71+$percentage72+ $percentage73+$percentage74+$percentage75+$percentage76+$p +ercentage77+$percentage78+ $percentage79+$percentage7a+$percentage7b+$percentage7c+$p +ercentage7d+$percentage7e+ $percentage7f+$percentage80+$percentage81+$percentage82+$p +ercentage83+$percentage84+ $percentage85+$percentage86+$percentage87+$percentage88+$p +ercentage89+$percentage8a+ $percentage8b+$percentage8c+$percentage8d+$percentage8e+$p +ercentage8f+$percentage90+ $percentage91+$percentage92+$percentage93+$percentage94+$p +ercentage95+$percentage96+ $percentage97+$percentage98+$percentage99+$percentage9a+$p +ercentage9b+$percentage9c+ $percentage9d+$percentage9e+$percentage9f+$percentagea0+$p +ercentagea1+$percentagea2+ $percentagea3+$percentagea4+$percentagea5+$percentagea6+$p +ercentagea7+$percentagea8+ $percentagea9+$percentageaa+$percentageab+$percentageac+$p +ercentagead+$percentageae+ $percentageaf+$percentageb0+$percentageb1+$percentageb2+$p +ercentageb3+$percentageb4+ $percentageb5+$percentageb6+$percentageb7+$percentageb8+$p +ercentageb9+$percentageba+ $percentagebb+$percentagebc+$percentagebd+$percentagebe+$p +ercentagebf+$percentagec0+ $percentagec1+$percentagec2+$percentagec3+$percentagec4+$p +ercentagec5+$percentagec6+ $percentagec7+$percentagec8+$percentagec9+$percentageca+$p +ercentagecb+$percentagecc+ $percentagecd+$percentagece+$percentagecf+$percentaged0+$p +ercentaged1+$percentaged2+ $percentaged3+$percentaged4+$percentaged5+$percentaged6+$p +ercentaged7+$percentaged8+ $percentaged9+$percentageda+$percentagedb+$percentagedc+$p +ercentagedd+$percentagede+ $percentagedf+$percentagee0+$percentagee1+$percentagee2+$p +ercentagee3+$percentagee4+ $percentagee5+$percentagee6+$percentagee7+$percentagee8+$p +ercentagee9+$percentageea+ $percentageeb+$percentageec+$percentageed+$percentageee+$p +ercentageef+$percentagef0+ $percentagef1+$percentagef2+$percentagef3+$percentagef4+$p +ercentagef5+$percentagef6+ $percentagef7+$percentagef8+$percentagef9+$percentagefa+$p +ercentagefb+$percentagefc+ $percentagefd+$percentagefe; print "0x00 percentage: $percentage00\n"; print "0xFF percentage: $percentageff\n"; my $average = $sum/254; #divided by 254 bec +ause i just wanted to average everything inbetween x00 and xff $average = sprintf("%.3f", $average); print "0x01 - 0xFE percentage: $average\n\n";
    I tried to use eval with tr/// but it was returning unexpected results, and was probably something i was doing wrong. I did think about how i could do a loop or something, but the only way was with s///g and it took a while longer to accomplish what i was after. tho the script i made works great, yours is alot shorter and is faster.

      Here's another approach that may be a little faster than my previous one (but it will be nowhere near 3 - 4 seconds for 256 MB!).

      c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'A man, a plan, a canal: Panama!'; ;; my @char_counts; $#char_counts = 255; ++$char_counts[ ord(substr $s, $_, 1) ] for 0 .. length($s) - 1; die 'oops...' if $#char_counts != 255; ;; printf qq{'%s' (0x%02x) == $char_counts[$_] (%6.3f%%) \n}, chr, $_, ($char_counts[$_] / length $s) * 100 for grep defined($char_counts[$_]), 0 .. $#char_counts; " ' ' (0x20) == 6 (19.355%) '!' (0x21) == 1 ( 3.226%) ',' (0x2c) == 2 ( 6.452%) ':' (0x3a) == 1 ( 3.226%) 'A' (0x41) == 1 ( 3.226%) 'P' (0x50) == 1 ( 3.226%) 'a' (0x61) == 9 (29.032%) 'c' (0x63) == 1 ( 3.226%) 'l' (0x6c) == 2 ( 6.452%) 'm' (0x6d) == 2 ( 6.452%) 'n' (0x6e) == 4 (12.903%) 'p' (0x70) == 1 ( 3.226%)

      ... how HxD is able to do it ...

      ... is by writing the code in C or some such compiled language — at least, I'd be willing to bet doughnuts to dollars that's the case. You, too, can do this with Inline::C! (Update: See also Inline::C::Cookbook.) In fact, the array-based approach in the code example above should, I think, convert very neatly to C. The learning curve for Inline::C is not too bad (assuming you know C!) and well worth the effort if you have a need for speed! (I need to brush up on Inline::C myself, so if I have some time later, I may play around with this.)

        actually sir, on a file that is 239 MB (251,396,096 bytes), This script took 2.76316 seconds to execute. so this is actually faster than HxD <.<

        EDIT: nvm i was calling the wrong file lol