in reply to Byte counts and Seek function

You're in for a world of pain if you try to mix byte counts with UTF-8, because a UTF-8 glyph may be represented by more than one byte's worth of codepoints. seek doesn't take variable-width encodings into account. It only counts bytes.

(I don't know what your utf8 function does, so I can't comment on what your call to encode does.)

Seems to me that it would be easier to use pos tell when you read in a sentence and keep that position around, rather than try to reconstruct it from the data you've read (and decoded, possibly normalized, et cetera).

Replies are listed 'Best First'.
Re^2: Byte counts and Seek function
by choroba (Cardinal) on Aug 27, 2013 at 22:48 UTC
    Are you sure you would use pos? I always thought seek should be used with tell.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Yes, you're right. I was thinking of fgetpos in C for some reason (and even there I'd use ftell, so I don't know what I was thinking at all).

Re^2: Byte counts and Seek function
by AnomalousMonk (Archbishop) on Aug 27, 2013 at 22:40 UTC

    utf8 (emphases added):

    utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source
    code
    ...
    The "use utf8" pragma tells the Perl parser to allow UTF-8 in the
    program text in the current lexical scope ...

      That's the utf8 pragma. I know what it does in the posted code: nothing, because there are no non-ASCII characters appearing literally in the source code.

      What's the utf8 function in the OP's code do?

        Oops. Visually scanned for it, but didn't see the utf8 function call the first time through. Should have used a highlighting finder!   (Damned human eyes...)