gerleu has asked for the wisdom of the Perl Monks concerning the following question:
Hello Dear PerlMonks,
I've signed integers which a must save in a file in the most compact format because, latter, I must read those values back to retrieve those signed integers...
I know that I can use pack and unpack using C4 as format, but in this case I increase the storage space by 25% ! My signed integer values fit perfectly in only 24 bits.
Do you have any simple or difficult or perlmonked solution because I was unable to find it ? Do you think that converting my signed integers to unsigned (thus all positive values) integers (because I know what will be the greatest negative integer) can help ?
Thanks in advance for your help (or prayers otherwise) !
Germain
Re: 3-byte representation (Simplified!)
by BrowserUk (Patriarch) on Oct 12, 2011 at 16:31 UTC
|
Update: Simplified the packing by removing a redundant step.
Update2: eliminated the map from the packing.
Non-trivial:
#! perl -slw
use strict;
use Data::Dump qw[ pp ];
use Math::Random::MT qw[ rand ];
my @sint24s = map{ -2**23 + int( rand 2**24 ) } 1 .. 20;
## removed duplicated my $packed -- Thanks to AnonmalousMonk
my $packed = join '', unpack '(a3x)*', pack 'l*', @sint24s;
my @unpacked = map {
unpack 'l', $_ . chr( vec( $_, 23, 1 ) ? 255 : 0 );
} unpack '(a3)*', $packed;
print "$sint24s[ $_ ] ;; $unpacked[ $_ ]" for 0 .. $#sint24s;
__END__
C:\test>sint24
4243386 ;; 4243386
4809369 ;; 4809369
-888567 ;; -888567
-7576685 ;; -7576685
1987080 ;; 1987080
-2170022 ;; -2170022
-1135866 ;; -1135866
1924446 ;; 1924446
6348263 ;; 6348263
1911716 ;; 1911716
-1791354 ;; -1791354
-8343943 ;; -8343943
-6224088 ;; -6224088
3919567 ;; 3919567
-1176382 ;; -1176382
6288012 ;; 6288012
-5569609 ;; -5569609
-5363232 ;; -5363232
-1344267 ;; -1344267
3649155 ;; 3649155
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
my $packed = join '', unpack '(a3x)*', pack 'l*', @sint24s;
Rather than the above,
my $packed = pack '(l<X)*', @array;
is more concise, probably faster (but this latter not checked).
A shame there seems to be no way to get rid of the final map on the unpacking side.
>perl -wMstrict -le
"use Test::More 'no_plan';
use List::Util qw(shuffle);
;;
sub rand_s24 { return map { int(rand 2**24) - 2**23 } 1 .. $_[0]; }
;;
use constant MIN_S24 => -(2**23);
use constant MAX_S24 => -MIN_S24() - 1;
use constant VALUES => (0, 1, -1, MIN_S24, MAX_S24);
use constant N_VALUES => scalar(@{[ VALUES ]});
use constant MAX_LEN => 25;
;;
for my $pass (1 .. 200) {
for my $len (1 .. MAX_LEN) {
my @to_pack = (rand_s24($len - N_VALUES), VALUES)[-$len .. -1];
@to_pack = shuffle @to_pack;
$len == @to_pack or die qq{bad len: array to pack};
;;
my $packed = pack '(l<X)*', @to_pack;
length($packed) == 3 * $len or die qq{bad len: packed string};
;;
my @unpacked =
map { unpack('l<', qq{\x00$_})/256 } unpack '(a3)*', $packed;
$len == @unpacked or die qq{bad len: unpacked array};
;;
is_deeply \@unpacked, \@to_pack,
sprintf('pass %d: len %d', $pass, $len);
}
}
"
ok 1 - pass 1: len 1
ok 2 - pass 1: len 2
... (4996 lines elided) ...
ok 4999 - pass 200: len 24
ok 5000 - pass 200: len 25
1..5000
| [reply] [d/l] [select] |
|
my $packed = pack '(l<X)*', @array; is more concise, probably faster
That's clever, but does have the limitation that you cannot produce a big-endian stream.
Mind, I don't know if there is any hardware that accepts 24-bit BE values.
A shame there seems to be no way to get rid of the final map on the unpacking side.
Agreed. This isn't the first time that I've wished that pack would allow the insertion of values from the template.
I've also wished for a more generic version of vec that allowed arbitrary numbers of bits, rather than just powers of 2.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
Re: 3-byte representation
by ikegami (Patriarch) on Oct 12, 2011 at 17:50 UTC
|
pack: $s = substr(pack('l>', $n), 1);
unpack: $n = unpack('l>', "$s\0")/256;
If you want little-endian byte order:
pack: $s = substr(pack('l<', $n), 0, 3);
unpack: $n = unpack('l<', "\0$s")/256;
PS — Going from 3 to 4 increases the space needed by 33%, not 25%.
Update: Slightly simpler solutions.
Update: The /256 got dropped when I posted my solution. Fixed. | [reply] [d/l] [select] |
|
sub pack24{ substr( pack('l>', $_[0]), 1) };;
sub unpack24{ unpack('l>', "$_[0]\0") };;
print "$_: ", unpack24( pack24( $_ ) )
for ( -8388608, -8388607, -2, -1, 0, 1, 2, 8388606, 8388607 );;
-8388608: -2147483648
-8388607: -2147483392
-2: -512
-1: -256
0: 0
1: 256
2: 512
8388606: 2147483136
8388607: 2147483392
Little-endian: sub pack24{ substr( pack('l<', $_[0]), 1) };;
sub unpack24{ unpack('l<', "$_[0]\0") };;
print "$_: ", unpack24( pack24( $_ ) )
for ( -8388608, -8388607, -2, -1, 0, 1, 2, 8388606, 8388607 );;
-8388608: 16744448
-8388607: 16744448
-2: 16777215
-1: 16777215
0: 0
1: 0
2: 0
8388606: 32767
8388607: 32767
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
| [reply] [d/l] |
|
#! perl -slw
use strict;
sub pack24be{ substr( pack('l>', $_[0]), 1) }
sub unpack24be{ unpack('l>', "$_[0]\0") >> 8 }
sub pack24le{ substr( pack('l<', $_[0] ), 0, 3); }
sub unpack24le{ unpack( 'l<', "\0$_[0]" ) >> 8 }
print( "be: $_: ", unpack24be( pack24be( $_ ) ) ),
print( "le: $_: ", unpack24le( pack24le( $_ ) ) )
for ( -8388608, -8388607, -2, -1, 0, 1, 2, 8388606, 8388607 )
__END__
C:\test>junk101 (version 2 )
be: -8388608: -2147483648
le: -8388608: -2147483648
be: -8388607: -2147483392
le: -8388607: -2147483392
be: -2: -512
le: -2: -512
be: -1: -256
le: -1: -256
be: 0: 0
le: 0: 0
be: 1: 256
le: 1: 256
be: 2: 512
le: 2: 512
be: 8388606: 2147483136
le: 8388606: 2147483136
be: 8388607: 2147483392
le: 8388607: 2147483392
C:\test>junk101 (version 3)
be: -8388608: 72057594029539328
le: -8388608: 72057594029539328
be: -8388607: 72057594029539329
le: -8388607: 72057594029539329
be: -2: 72057594037927934
le: -2: 72057594037927934
be: -1: 72057594037927935
le: -1: 72057594037927935
be: 0: 0
le: 0: 0
be: 1: 1
le: 1: 1
be: 2: 2
le: 2: 2
be: 8388606: 8388606
le: 8388606: 8388606
be: 8388607: 8388607
le: 8388607: 8388607
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
Update: The >> 8 got dropped when I posted my solution. Fixed.
Silently becomes
Update: The /256 got dropped when I posted my solution. Fixed.
Hm. "Got dropped" huh!
| [reply] |
|
Thank you for all your gentle answers, but if I run this:
#!/usr/bin/perl
print "Content-type: text/html\n\n";
srand();
$fil="ch";
open(OUT, '>>'.$fil);
for($i=1;$i<10001;$i++){
$j=int(rand(20000))+440000;
$k=substr(pack('l>',$j),1);
print OUT $k;
}
close(OUT);
print "DONE !";
it produces one file with a variable length, but always greater than 30000 characters (for exemple 30045) !
So I suppose something is wrong in the pack line coding... | [reply] [d/l] |
|
|
|
|
It was /256. Just misremembered cause I changed comps in the middle.
| [reply] |
|
|
Re: 3-byte representation
by cavac (Prior) on Oct 12, 2011 at 16:38 UTC
|
Edit: Ooops, my code only works for unsigned integers. I'll leave it here anyway, might at least come in handy for someone else at a later time...
Ok, Here's a simple example of how i did it as a test:
#!/usr/bin/perl
use strict;
use warnings;
my $val = 70000;
print "Saving Value: $val\n";
# ENCODE
my @bytes;
for(my $i = 0; $i < 3; $i++) {
$bytes[$i] = $val % 256;
if($val) {
$val = int($val / 256);
}
}
my $savebytes = pack("ccc", @bytes);
# DECODE
my @newbytes = unpack("ccc", $savebytes);
my $newval = 0;
while(scalar @newbytes) {
my $byte = pop @newbytes;
$newval = $newval * 256;
$newval += $byte;
}
print "Retrieved Value is: $newval\n";
Here's the output:
Saving Value: 70000
Retrieved Value is: 70000
I'm pretty sure there is a better way, but i'm not an expert on all this math stuff...
Edit: If you have a lot of numbers to encode/decode, you might want to consider putting that stuff into a C module or use Inline::C for speed.
Don't use '#ff0000':
use Acme::AutoColor; my $redcolor = RED();
All colors subject to change without notice.
| [reply] [d/l] [select] |
Re: 3-byte representation
by TomDLux (Vicar) on Oct 18, 2011 at 15:10 UTC
|
How many numbers do you have in a file? How many files? How many files can you store on a $65 2-TB drive?
The basic premise of Unix is to store data as text, if at all possible. This makes it simple to process it using utilities you hadn't considered when the file was created.
Your numbers fit in 24 bits, so +/- 8,366,608 ... in fact possibly a smaller range, since you suggest adding a constant to shift the numbers to all-positive. If the numbers are evenly distributed, storing as text requires a separator plus 1-7 digits, plus a possible minus sign. That works out to an average of 5 bytes per number, if you're using ASCII. If there is any asymmetry to make small values more likely than large ones, it might be better than 5 bytes. For a small loss you are now able to feed your files through grep, dc, tr, sed, awk, perl.
As Occam said: Entia non sunt multiplicanda praeter necessitatem.
| [reply] |
|
Hello TomDLux and thank for your answer !
I've several thousands couples of numbers (in fact latitudes and longitudes, at first with 4 decimal precision. but converted to integers during the creation of a file).
Physical storage place is not an issue, only memory space if the fastest parsing solution is memory intensive, because many different files will be processed by many different users at the same time......
| [reply] |
|
Considering a process is allocated megabytes of memory when it runs, worrying about a few thousand kilobytes of wasted space is not really productive.
Wasting memory or wasting CPU resources is never good, of course, but correctness is the first priority. Once you have a solution that works correctly, tighten up memory use, IF it presents a problem in the number of instances you can run; tighten up the algorithmns involved, IF there's a problem with run time.
No problem? Go on to the next task.
As Occam said: Entia non sunt multiplicanda praeter necessitatem.
| [reply] |
|
|