muad33b has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a parser that returns the text or pdfs to a swish-search-indexer. The delimter of the character stream is a "content-length" header that needs to be in bytes or the next header is missed. When using:

use bytes;
my $size = length ($txt);
no bytes;
print <<EOF
content-lenth: $size
EOF

the "bytes" doesn't appear to work properly when the text output from "pdftotext" returns with multi-byte characters. It still seems to count characters instead of bytes and returns less then the same data written to a file, thereby throwing of my indexer.

Does anyone know a way around this besides writing the text out to a file, and getting the stat size of it?

Thanks, Jeff

Replies are listed 'Best First'.
Re: use bytes and length problem
by BrowserUk (Patriarch) on Mar 02, 2003 at 00:10 UTC

    You might try,

    my $size = scalar unpack'C*', $txt;

    A quick check on some content grabs from UTF websites shows a difference between the counts from that and

    my $size = scalar unpack'U*', $txt;

    using 5.6.1 and 5.8 (AS).

    Also, the filesize shown by the OS of the same content written in binmode is same as the first above.


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.

      I read up on pack/unpack, and that looked to be a good way to go, however, those are both returning "60" instead of the "208786" value I was hoping for... I'm going to continue to play with the unpack, see what I come up with... any ideas?

      Jeff

        I tested the method on this page and several of those linked from it and it gives me the correct size every time (as compared to the same data dumped to a file in binmode).

        I am pretty certain that the problem is that you are using unpack incorrectly. If you would care to post an example of the how you are using it, and a (small) sample of data that you are using it on, then I am certain that we could sort it the problem out.

        Your reply to pfaut below indicates that you are either trying to use the '/' template character, which will never give you the answers that you require, or you are mixing up the information from different parts of the pack/[and unpack documentation--which would be no surprise as it is probably some of least understandable of the perldocs.


        Examine what is said, not who speaks.
        1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
        2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
        3) Any sufficiently advanced technology is indistinguishable from magic.
        Arthur C. Clarke.

        I played around with unpack, and it appears as if the following line from the pack() perldoc (5.6 and 5.8)page is true:

        The length-item is not returned explicitly from unpack.

        Any other ideas? I know I can write this out to a file, but it seems crazy to me to have to do that just to get a proper byte count.

        Jeff

Re: use bytes and length problem
by crenz (Priest) on Mar 01, 2003 at 23:28 UTC

    Which version of perl are you using? There are some known problems with use bytes; in perl 5.6.1. Have you tried 5.8.0?

      I'm using 5.8.0 on Redhat 8.0, and binmode should be in place since it's linux right? I'm looking everywhere... no luck so far.

Re: use bytes and length problem
by pg (Canon) on Mar 02, 2003 at 21:54 UTC
    I would suspect that the problem resides in the way your $txt is created. I wrote up this piece of demo, to show different ways to form your string, and "use bytes" works all the time.

    Hope this helps:

    use strict; sub display { my $string = shift; use utf8;# as you can see from the result, whether to use utf8, or + bytes is irrelevant in this demo, as "U*' forces unicode any way print "\nchar semantics: "; print "$string "; printf "Length = %d, ", length($string); printf "Content = %vd\n", $string; use bytes; print "byte semantics: "; print "$string "; printf "Length = %d, ", length($string); printf "Content = %vd\n", $string; } my $encoded_string; my @decoded_list; { use bytes; print "=========================\n"; print "Case 1: create string from pack, with use bytes\n"; $encoded_string = pack("U*", 400, 306); display $encoded_string; @decoded_list = unpack("U*", $encoded_string); print join(".", @decoded_list), "\n"; } { use utf8; #not necessary in this case print "=========================\n"; print "Case 2: create string from pack, with use utf8\n"; $encoded_string = pack("U*", 400, 306); display $encoded_string; @decoded_list = unpack("U*", $encoded_string); print join(".", @decoded_list), "\n"; } { print "=========================\n"; print "Case 3: create string from \\x{}\n"; $encoded_string = "\x{190}\x{132}";#hex value of 400 and 306 display $encoded_string; @decoded_list = unpack("U*", $encoded_string); print join(".", @decoded_list), "\n"; }
      For those of you who are too lazy to run pg's code, here's the output ;-)
      ========================= Case 1: create string from pack, with use bytes char semantics: IJ Length = 4, Content = 198.144.196.178 byte semantics: IJ Length = 4, Content = 198.144.196.178 400.306 ========================= Case 2: create string from pack, with use utf8 char semantics: IJ Length = 4, Content = 198.144.196.178 byte semantics: IJ Length = 4, Content = 198.144.196.178 400.306 ========================= Case 3: create string from \x{} char semantics: IJ Length = 2, Content = 400.306 byte semantics: IJ Length = 4, Content = 198.144.196.178 400.306

      Update I'm on perl 5.6.0 on solaris, so it's probably my own problem ;-). Full spec:

      -- Hofmator

        Now this is getting interesting :-), when I ran my code, I got this: (I am using AS 5.8.0, and the testing code for case 4 is at the end of this post).

        =========================
        Case 1: create string from pack, with use bytes
        
        char semantics: ƐIJ Length = 2, Content = 400.306
        byte semantics: ƐIJ Length = 4, Content = 198.144.196.178
        198.144.196.178
        =========================
        Case 2: create string from pack, with use buyes
        
        char semantics: ƐIJ Length = 2, Content = 400.306
        byte semantics: ƐIJ Length = 4, Content = 198.144.196.178
        400.306
        =========================
        Case 3: create string from \x{}
        
        char semantics: ƐIJ Length = 2, Content = 400.306
        byte semantics: ƐIJ Length = 4, Content = 198.144.196.178
        400.306
        =========================
        Case 4: read string from unicode file
        
        char semantics: 裴佳谷
         Length = 4, Content = 35060.20339.35895.10
        byte semantics: 裴佳谷
         Length = 10, Content = 232.163.180.228.189.179.232.176.183.10
        
        Also, I want to add a case to cover the situation where you read your string from file:
        { print "=========================\n"; print "Case 4: read string from utf8 file\n"; open(FILE, "<:utf8", "test.txt"); $encoded_string = <FILE>; display $encoded_string; }
Re: use bytes and length problem
by John M. Dlugosz (Monsignor) on Mar 03, 2003 at 07:27 UTC
    That's one of the first things that got me when I first started exploring bytes vs. utf8 when it came out.

    The use bytes does not affect the way length works.

    Rather, the $txt value is already marked as to whether it is byte or char oriented.

    It really bugged me that there was no way to tell which way a string was oriented (prior to 5.8, or adding the Scalar::Utils module (IIRC the name), or more importantly in cases like this of setting the flag.

    I don't know off hand if Scalar::Utils can write the desired flag setting. If not, the way we've done it 'till now is with the "taint-like trick" of matching the whole string with a trivial pattern in parens. The resulting $1 will have the byte/char persuasion that the regex was compiled under (use utf8 or no utf8). I think the bytes pragma had nothing to do with it. That may have changed in 5.8.

    —John

      Thanks to all for your help. It appears as though

      my $size= utf8::upgrade($txt);

      has done the job for the problem. I'm actually doing:

      my $size= utf8::upgrade($txt);
      utf8::downgrade($txt);

      Although I seem to be ok without the downgrade, just in case it might cause me trouble later for the moment.

      This comes from the *use utf8* perldoc. Does this make sense to all? Any closing thoughts?

      Again, Thanks to everyone for thier help.

      Jeff

        The upgrade and downgrade functions are not in Perl 5.6's documentation, so it must be new to 5.8. Nice improvement!

        In case you didn't find it yet, the use utf-8 affects the compilation of regular expressions.

        —John

        P.S. you forgot to log in again. Try setting your theme to something other than the default. Then it will be obvious if you're not logged in.