Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

file/language entropy calculator

by wufnik (Friar)
on Dec 09, 2004 at 12:24 UTC ( [id://413493]=CUFP: print w/replies, xml ) Need Help??

determine if it really is monkeys at typewriters you are dealing with or no. supply file on the cmd line. the script calculates the character entropy of a supplied file. nothing fancy. benchmark this agains ent, the c prog, or something fancier.
# determine char entropy from file. by wufnik. my $fil = shift; open (FIL, $fil) || die ("no filfor entropy calc, $!"); my $filstr = do { local $/; <FIL> }; close FIL; my @chars = split //, $filstr; my (%charhash, $charstot); map { $charhash{$_}++; $charstot++ } @chars; my @values = map { $_ / $charstot } values %charhash; my $ent = entropy(\@values); printf ("file %s\ncontents entropy = %20.15f\n",$fil,$ent); sub entropy{ my ($listr, $baselog) = @_; $baselog = 0.693147180559945 unless $baselog; # log(2) return undef unless ref $listr; my $sum; my @nums = @$listr; map { $sum += $_ * (log($_)/$baselog) } @nums; return -$sum; }
(edited to adopt sensible precision in printf as pointed out by graff)

Replies are listed 'Best First'.
Re: file/language entropy calculator
by graff (Chancellor) on Dec 11, 2004 at 17:33 UTC
    <nitpick> Before trying to benchmark it, I think I'd try tightening it up a little to eliminate some waste...
    my $fil = shift; open (FIL, $fil) || die ("can't read $fil: $!"); my $filstr = do { local $/; <FIL> }; close FIL; my %charhash; $charhash{$_}++ for ( split //, $filstr ); my $ent = entropy(\%charhash, length( $filstr )); printf ("file %s\ncontents entropy = %30.20f\n",$fil,$ent); sub entropy { my ($hashref, $total, $baselog) = @_; $baselog = 0.693147180559945 unless $baselog; # log(2) return undef unless ( ref $hashref and $total > 0 ); my $sum; $sum += $_ * (log($_)/$baselog) for ( map { $_/$total } values %$has +href ); return -$sum; }
    It's nice to avoid "use of map in void context", unnecessary copies of data, and "$counter++" where a simpler method (length()) will do. </nitpick>

    (updated to eliminate the unnecessary "@values" array; then updated a couple more times to fix details associated with removing @values.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://413493]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2024-03-28 18:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found