Unpacking small chucks of data quickly

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Unpacking small chucks of data quickly by Fletch (Bishop) on Nov 19, 2007 at 22:54 UTC
read does a buffered read (analogous to and possibly implemented in terms of the C `fread(3)` routine), so even though you're asking for only 8 bytes at a time the underlying `read(2)` system call isn't going to be called each time. And probably unrelated speed wise, but your explicitly checking for and allocating an array ref isn't really necessary. Search for "autovivification", but the short answer is just use it in the right place and Perl will have an arrayref there for you. Not that either of those solves your speed problems . . . :) Update: and on a second reading, perhaps it's the fact that you're building a structure out of our 100M file in RAM that's the bottleneck. Consider using a hash-on-disk instead of keeping things all in memory. We're looking for a Perl and Database Developer for Corporate Investments Group.	[reply]
Re^2: Unpacking small chucks of data quickly by ikegami (Patriarch) on Nov 19, 2007 at 23:10 UTC
perhaps it's the fact that you're building a structure out of our 100M file in RAM that's the bottleneck Quite possible. Data takes much more space as Perl variables than as it does in a file. Especially since hash keys are strings. `use strict; use warnings; use Devel::Size qw( total_size ); my $file = join '', map pack('NN', @$_), ( [ 273, 1234 ], [ 273, 5678 ], [ 274, 1234 ], [ 275, 5678 ], [ 276, 1234 ], [ 277, 5678 ], [ 278, 1234 ], [ 278, 5678 ], ); my %ValArrayByID; while ($file =~ /(.{8})/g) { my ($ID, $Val) = unpack('NN', $1); push @{$ValArrayByID{$ID}}, $Val; } print("File size: ", length($file), " bytes\n"); print("Memory usage: ", total_size(\%ValArrayByID), " bytes\n");` [download] `File size: 64 bytes Memory usage: 922 bytes` [download]	[reply] [d/l] [select]
Re: Unpacking small chucks of data quickly by johngg (Canon) on Nov 19, 2007 at 23:24 UTC
This node by BrowserUk may be of interest to you. Cheers, JohnGG	[reply]
Re: Unpacking small chucks of data quickly by ikegami (Patriarch) on Nov 19, 2007 at 22:56 UTC
`read` is buffered, so the 8-bytes at a time isn't so bad. You could try reading more at a time and replacing calls to `read` with calls to `substr` and see if that's faster. I don't think it'll help, but only a benchmark will tell. `not defined $ValArrayByID{$ID} and $ValArrayByID{$ID} = [];` is redundant (and rather unreadable as written) since `push` already does this. You don't handle `read` returning anything but 0 or $length. It could return -1 on error, or a number less than $length if fewer than $length bytes are available at the time.	[reply] [d/l] [select]
Re: Unpacking small chucks of data quickly by Anonymous Monk on Nov 19, 2007 at 23:48 UTC
Thanks all for your help. I was silly to manually vivify the array ref. I tried a few variants, but I cannot squeeze out the performance that I want. Maybe the problem can be solved on the writing end: Is they a good (best!) way to store a hash of variable-length arrays of unsigned integers? Instead of key,value pairs with redundant keys? -poster	[reply]
Re^2: Unpacking small chucks of data quickly by ikegami (Patriarch) on Nov 20, 2007 at 00:50 UTC
And what kind of performance are you looking for? (The current method takes 2.5s for 8MB — so ~30s for 100MB — on loaded web server.)	[reply] [d/l] [select]
Re^2: Unpacking small chucks of data quickly by ikegami (Patriarch) on Nov 20, 2007 at 00:39 UTC
What are you trying to do with the data?	[reply]
Re: Unpacking small chucks of data quickly by salva (Canon) on Nov 20, 2007 at 09:26 UTC
A memory conservative approach is to sort the data so pairs with the same id become consecutive: `# untested use Sort::Key::Radix qw(ikeysort); my $packed_data = do { local $/; <$fh> }; my $last = length($packed_data) / 8 - 1; my @sorted_ix = ikeysort { unpack I => substr($packed_data, $_ * 8, 4) + } 0..$last; # now iterating over @sorted_ix, you get the data sorted by Id: for my $ix (@sorted_ix) { my ($id, $val) = unpack II => substr($packed_data, $_ * 8, 8); print "id: $id, val: $val\n" }` [download]	[reply] [d/l]
Re: Unpacking small chucks of data quickly by locked_user sundialsvc4 (Abbot) on Nov 20, 2007 at 23:56 UTC
Think of “memory” as being “a disk file,” not semiconductor chips on a circuit-board. Because in a modern computer system that's what it really is: virtual memory. So, if you have "100 megabytes of anything at all," you do not want to build an in-memory hash of it for any reason whatever. It appears to me that what you want to have, as the output of your program, is a structure which lists, for each "ID", all of the "VALs" for that ID. Therefore, let me suggest an alternate approach. (Or rather, second what has already been suggested.) Sort the file, using an on-disk sort like the `sort` command. When you do that, first by ID and then by VAL, you know that all of the records having any given ID-value will be adjacent to one another. So now, you read the sorted file sequentially, and you remember what the “previous” ID was, to see if it is the same or different as “this” one. If different, then the end of one group has been reached and the start of a new one has begun. If the same, you have another ID to be added to the current group. Finally, when the end of the file has been reached, you're by-definition at the end of the final group. Yes, this is literally what those folks were doing with those punched cards and magnetic tapes, all those years ago even before computers were invented. A hundred megs? Oh, I'd be very surprised if it took even five seconds, after the sort is through. And the sort won't take long either.
Re^2: Unpacking small chucks of data quickly by spectre9 (Beadle) on Nov 21, 2007 at 15:59 UTC
If you are working within a Windows environment, go ahead an install MingW or Cygwin so you can gain access to `sort` and some of the other Unix tools. `sort` is almost the ideal, as algorithm improved for tens of years, and most implementations have handlers for special cases, such as merges between multiple files that were already sorted. Although I veer a bit off the posts original course, it seems others have delved into the area of optimization so I'd like to drop some references for fellow geeks to some uses of Perl in the scientific community, as well as Hierarchical Data Format: For large data sets HDF offers many of the advantages sought by this needy monk... HDF is designed to access to ordered and hierarchical sets of data on a large scale, optimized for performance and compatible with large-scale parallelization or distributed data systems. HDF is used to store most of NASA's imaging data from satellites, but finds many other uses such as optimizing hi-speed HTML templating systems (as found in Clearsilver and the associated Data::ClearSilver::HDF. There also is a CPAN module PDL::IO::HDF5 that reads/writes HDF5. IF performance is really a concern, then using an appropriate storage mechanism for the on-disk data is the place one should focus. Perl makes it easy to measure this performance using Perl's benchmarking and profiling features. Perl itself can perform suprisingly well even in high throughput applications if the code is optimized based upon data gathered through profiling. You only have to look at "Bio" perl to discover plenty of examples. And for a real diversion, the book Perl for Exploring DNA that came out in July looks like a fascinating book. Probably has a whole slew of ideas for regexp or advanced pattern matching. spectre#9 -- "Strictly speaking, there are no enlightened people, there is only enlightened activity." -- Shunryu Suzuki	[reply] [d/l] [select]
Re^3: Unpacking small chucks of data quickly by BrowserUk (Patriarch) on Nov 21, 2007 at 19:50 UTC
... go ahead an install MingW or Cygwin so you can gain access to sort and some of the other Unix tools. No need for either: UnxTools Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]