Re: Count byte/character occurrence (1/4)
by BrowserUk (Patriarch) on Apr 01, 2016 at 09:37 UTC
|
C:\test>dir 1123355.bin
14/04/2015 16:50 16,777,216 1123355.bin
C:\test>1159245 1123355.bin
Took 12.897612 secs
(That was the third run so the cache was primed.)
This version took: C:\test>1159245 1123355.bin
Took 3.832763 secs
: 3762666
☺ : 46120
☻ : 43642
♥ : 44106
♦ : 43878
...
The code; #! perl -slw
use strict;
use Time::HiRes qw[ time ];
my $start = time;
open I, '<:raw', $ARGV[ 0 ];
my @seen;
while( read( I, my $buf, 16384 ) ) {
++$seen[$_] for unpack 'C*', $buf;
}
printf "Took %f secs\n", time() - $start;
printf "%c : %u\n", $_, $seen[$_] for 0 .. 255;
| [reply] [d/l] [select] |
|
| [reply] [d/l] |
|
| [reply] |
|
|
WOW! thats alot faster. I didnt even think about unpack. that code runs in ~2.6 seconds on my machine. Thanks!
| [reply] |
|
Away for pc right now, but would something with a for loop and substr be faster? I dont necessarily have to unpack any bytes, do I?
| [reply] |
|
would something with a for loop and substr be faster? I dont necessarily have to unpack any bytes, do I?
That requires a call into C (substr) for every byte; where using unpack requires a single call for the entire string.
The cardinal rule for optimising Perl code, is to get perl's built-ins to do as much of the work as you can.
Using this loop:
++$seen[ ord chop $buf ] while length $buf;
in place of the unpack loop is almost but not quite as fast.
It trades 2 built-in calls per byte, against the cost of building the unpack return list on the stack, and loses by a hair.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: Count byte/character occurrence (quickly)
by Corion (Patriarch) on Apr 01, 2016 at 06:30 UTC
|
Don't read the file byte by byte, but read it in larger chunks and have an inner loop counting the characters:
open my $file, '<', shift;
binmode($file);
my %seen = ();
$/ = \( 1024 * 1024 ); # Set default buffer size
while ( <$file> ) {
for my $buf (split //, $_) {
if ( !$seen{$buf} ) {
$seen{$buf} = 1;
}
else {
$seen{$buf}++;
}
}
}
print "$_ - $seen{$_}\n" for ( keys %seen );
| [reply] [d/l] |
|
Sadly, when I try to set it to a high(er) buffer size it slows down for some reason. I have an Intel i5 2500k and samsung 840 pro with a biostar tz77xe4 motherboard. It should be reading faster than heck. Is mine (or yours) code some how saving the results of the read instead of throwing them away after reading and adding to seen? I tried to undef $buf as well and that even slowed it down
EDIT: also, i am using perl 5.16.3 for certain reasons.
| [reply] |
Re: Count byte/character occurrence (quickly)
by Tux (Canon) on Apr 01, 2016 at 07:57 UTC
|
$seen{$_}++ for split // => $buf;
Enjoy, Have FUN! H.Merijn
| [reply] [d/l] |
|
This is very good to know, thanks. It doesnt however speed up the script, which is what I am ultimately after. I have been looking at multi-threading, but I am lost on that subject and unsure if I would see that much of a speed up.
| [reply] |
Re: Count byte/character occurrence (quickly)
by marioroy (Parson) on Apr 01, 2016 at 14:17 UTC
|
The following is a parallel demonstration, based on the code by BrowserUk. There is a bug in MCE::Shared 1.001 and the reason for the length check below. MCE 1.704 and MCE 1.002 will be released in ETA ~ 1 week with the fix.
MCE::Flow and MCE::Shared
use strict;
use warnings;
use MCE::Flow;
use MCE::Shared;
use Time::HiRes qw[ time ];
my $start = time;
my $fh = MCE::Shared->handle( "<:raw", $ARGV[ 0 ] );
my @seen;
sub tally {
my ($aref) = @_;
for ( 0 .. 255 ) {
$seen[$_] += $aref->[$_] if $aref->[$_];
}
return;
}
mce_flow { max_workers => 8 }, sub {
my @_seen;
while( read( $fh, my $buf, 16384 * 4 ) ) {
# the length check may be omitted with MCE::Shared 1.002+
last unless length($buf);
++$_seen[$_] for unpack 'C*', $buf;
}
MCE->do('tally', \@_seen);
};
close $fh;
printf "Took %f secs\n", time() - $start;
# for ( 0 .. 255 ) {
# printf "%c : %u\n", $_, $seen[$_] if $seen[$_];
# }
The serial code takes 8.390 seconds. In comparison, the parallel code completes in 2.253 seconds for a 126 MB file on a machine with 4 real cores and 4 hyper-threads.
Update:
The upcoming MCE::Shared 1.002 release will support the following construction by allowing the main or worker process to handle the error. I've been wanting for the shared open call to feel like the native open call.
use MCE::Shared 1.002;
mce_open my $IN, "<:gzip", "wat.paths.gz" or die "open error: $!";
mce_open my $OUT, ">", \*STDOUT or die "open error: $!";
| [reply] [d/l] [select] |
|
The following are parallel demonstrations using MCE::Hobo and threads.
MCE::Hobo and MCE::Shared
A Hobo is a migratory worker inside the machine that carries the
asynchronous gene. Hobos are equipped with threads-like capability
for running code asynchronously. Unlike threads, each hobo is a unique
process to the underlying OS. The IPC is managed by MCE::Shared,
which runs on all the major platforms including Cygwin.
use strict;
use warnings;
use MCE::Hobo;
use MCE::Shared;
use Time::HiRes qw[ time ];
my $start = time;
my $fh = MCE::Shared->handle( "<:raw", $ARGV[ 0 ] );
my $seen = MCE::Shared->array;
sub task {
my @_seen;
while( read( $fh, my $buf, 16384 * 4 ) ) {
# the length check may be omitted with MCE::Shared 1.002+
last unless length($buf);
++$_seen[$_] for unpack 'C*', $buf;
}
for ( 0 .. 255 ) {
$seen->incrby($_, $_seen[$_]) if $_seen[$_];
}
}
MCE::Hobo->create('task') for 1 .. 8;
# do other stuff if desired
$_->join for MCE::Hobo->list;
close $fh;
printf "Took %f secs\n", time() - $start;
# export and destroy the shared array into a local non-shared array
$seen = $seen->destroy;
# for ( 0 .. 255 ) {
# printf "%c : %u\n", $_, $seen->[$_] if $seen->[$_];
# }
threads and MCE::Shared
The code for MCE::Hobo and threads are very similar.
use strict;
use warnings;
use threads;
use MCE::Shared;
use Time::HiRes qw[ time ];
my $start = time;
my $fh = MCE::Shared->handle( "<:raw", $ARGV[ 0 ] );
my $seen = MCE::Shared->array;
sub task {
my @_seen;
while( read( $fh, my $buf, 16384 * 4 ) ) {
# the length check may be omitted with MCE::Shared 1.002+
last unless length($buf);
++$_seen[$_] for unpack 'C*', $buf;
}
for ( 0 .. 255 ) {
$seen->incrby($_, $_seen[$_]) if $_seen[$_];
}
}
threads->create('task') for 1 .. 8;
# do other stuff if desired
$_->join for threads->list;
close $fh;
printf "Took %f secs\n", time() - $start;
# export and destroy the shared array into a local non-shared array
$seen = $seen->destroy;
# for ( 0 .. 255 ) {
# printf "%c : %u\n", $_, $seen->[$_] if $seen->[$_];
# }
| [reply] [d/l] [select] |
|
| [reply] |
|
|
|
Re: Count byte/character occurrence (quickly)
by james28909 (Deacon) on Apr 05, 2016 at 19:24 UTC
|
Thanks to all, I ended up doing it in C. I were having problems with MCE, if I dropped two files in one instance of the GUI it would not populated the count, and doing it in C has been the fastest so far. Thanks for everyones input.
| [reply] |