Re: use bytes and length problem
by BrowserUk (Patriarch) on Mar 02, 2003 at 00:10 UTC
|
my $size = scalar unpack'C*', $txt;
A quick check on some content grabs from UTF websites shows a difference between the counts from that and
my $size = scalar unpack'U*', $txt;
using 5.6.1 and 5.8 (AS).
Also, the filesize shown by the OS of the same content written in binmode is same as the first above.
Examine what is said, not who speaks.
1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.
| [reply] [d/l] [select] |
|
|
I read up on pack/unpack, and that looked to be a good way to go, however, those are both returning "60" instead of the "208786" value I was hoping for... I'm going to continue to play with the unpack, see what I come up with... any ideas?
Jeff
| [reply] |
|
|
I tested the method on this page and several of those linked from it and it gives me the correct size every time (as compared to the same data dumped to a file in binmode).
I am pretty certain that the problem is that you are using unpack incorrectly. If you would care to post an example of the how you are using it, and a (small) sample of data that you are using it on, then I am certain that we could sort it the problem out.
Your reply to pfaut below indicates that you are either trying to use the '/' template character, which will never give you the answers that you require, or you are mixing up the information from different parts of the pack/[and unpack documentation--which would be no surprise as it is probably some of least understandable of the perldocs.
Examine what is said, not who speaks.
1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.
| [reply] |
|
|
---
print map { my ($m)=1<<hex($_)&11?' ':'';
$m.=substr('AHJPacehklnorstu',hex($_),1) }
split //,'2fde0abe76c36c914586c';
| [reply] [d/l] |
|
|
I played around with unpack, and it appears as if the following line from the pack() perldoc (5.6 and 5.8)page is true:
The length-item is not returned explicitly from unpack.
Any other ideas? I know I can write this out to a file, but it seems crazy to me to have to do that just to get a proper byte count.
Jeff
| [reply] |
Re: use bytes and length problem
by crenz (Priest) on Mar 01, 2003 at 23:28 UTC
|
Which version of perl are you using? There are some known problems with use bytes; in perl 5.6.1. Have you tried 5.8.0?
| [reply] |
|
|
| [reply] |
Re: use bytes and length problem
by pg (Canon) on Mar 02, 2003 at 21:54 UTC
|
I would suspect that the problem resides in the way your $txt is created. I wrote up this piece of demo, to show different ways to form your string, and "use bytes" works all the time.
Hope this helps:
use strict;
sub display {
my $string = shift;
use utf8;# as you can see from the result, whether to use utf8, or
+ bytes is irrelevant in this demo, as "U*' forces unicode any way
print "\nchar semantics: ";
print "$string ";
printf "Length = %d, ", length($string);
printf "Content = %vd\n", $string;
use bytes;
print "byte semantics: ";
print "$string ";
printf "Length = %d, ", length($string);
printf "Content = %vd\n", $string;
}
my $encoded_string;
my @decoded_list;
{
use bytes;
print "=========================\n";
print "Case 1: create string from pack, with use bytes\n";
$encoded_string = pack("U*", 400, 306);
display $encoded_string;
@decoded_list = unpack("U*", $encoded_string);
print join(".", @decoded_list), "\n";
}
{
use utf8; #not necessary in this case
print "=========================\n";
print "Case 2: create string from pack, with use utf8\n";
$encoded_string = pack("U*", 400, 306);
display $encoded_string;
@decoded_list = unpack("U*", $encoded_string);
print join(".", @decoded_list), "\n";
}
{
print "=========================\n";
print "Case 3: create string from \\x{}\n";
$encoded_string = "\x{190}\x{132}";#hex value of 400 and 306
display $encoded_string;
@decoded_list = unpack("U*", $encoded_string);
print join(".", @decoded_list), "\n";
}
| [reply] [d/l] |
|
|
For those of you who are too lazy to run pg's code, here's the output ;-)
=========================
Case 1: create string from pack, with use bytes
char semantics: IJ Length = 4, Content = 198.144.196.178
byte semantics: IJ Length = 4, Content = 198.144.196.178
400.306
=========================
Case 2: create string from pack, with use utf8
char semantics: IJ Length = 4, Content = 198.144.196.178
byte semantics: IJ Length = 4, Content = 198.144.196.178
400.306
=========================
Case 3: create string from \x{}
char semantics: IJ Length = 2, Content = 400.306
byte semantics: IJ Length = 4, Content = 198.144.196.178
400.306
Update I'm on perl 5.6.0 on solaris, so it's probably my own problem ;-). Full spec:
-- Hofmator | [reply] [d/l] [select] |
|
|
Now this is getting interesting :-), when I ran my code, I got this: (I am using AS 5.8.0, and the testing code for case 4 is at the end of this post).
=========================
Case 1: create string from pack, with use bytes
char semantics: ƐIJ Length = 2, Content = 400.306
byte semantics: ƐIJ Length = 4, Content = 198.144.196.178
198.144.196.178
=========================
Case 2: create string from pack, with use buyes
char semantics: ƐIJ Length = 2, Content = 400.306
byte semantics: ƐIJ Length = 4, Content = 198.144.196.178
400.306
=========================
Case 3: create string from \x{}
char semantics: ƐIJ Length = 2, Content = 400.306
byte semantics: ƐIJ Length = 4, Content = 198.144.196.178
400.306
=========================
Case 4: read string from unicode file
char semantics: 裴佳谷
Length = 4, Content = 35060.20339.35895.10
byte semantics: 裴佳谷
Length = 10, Content = 232.163.180.228.189.179.232.176.183.10
Also, I want to add a case to cover the situation where you read your string from file:
{
print "=========================\n";
print "Case 4: read string from utf8 file\n";
open(FILE, "<:utf8", "test.txt");
$encoded_string = <FILE>;
display $encoded_string;
}
| [reply] [d/l] |
|
|
|
|
Re: use bytes and length problem
by John M. Dlugosz (Monsignor) on Mar 03, 2003 at 07:27 UTC
|
That's one of the first things that got me when I first started exploring bytes vs. utf8 when it came out.
The use bytes does not affect the way length works.
Rather, the $txt value is already marked as to whether it is byte or char oriented.
It really bugged me that there was no way to tell which way a string was oriented (prior to 5.8, or adding the Scalar::Utils module (IIRC the name), or more importantly in cases like this of setting the flag.
I don't know off hand if Scalar::Utils can write the desired flag setting. If not, the way we've done it 'till now is with the "taint-like trick" of matching the whole string with a trivial pattern in parens. The resulting $1 will have the byte/char persuasion that the regex was compiled under (use utf8 or no utf8). I think the bytes pragma had nothing to do with it. That may have changed in 5.8.
—John
| [reply] |
|
|
my $size= utf8::upgrade($txt);
has done the job for the problem. I'm actually doing:
my $size= utf8::upgrade($txt);
utf8::downgrade($txt);
Although I seem to be ok without the downgrade, just in case it might cause me trouble later for the moment.
This comes from the *use utf8* perldoc. Does this make sense to all? Any closing thoughts?
Again, Thanks to everyone for thier help.
Jeff | [reply] |
|
|
The upgrade and downgrade functions are not in Perl 5.6's documentation, so it must be new to 5.8. Nice improvement!
In case you didn't find it yet, the use utf-8 affects the compilation of regular expressions.
—John
P.S. you forgot to log in again. Try setting your theme to something other than the default. Then it will be obvious if you're not logged in.
| [reply] |