Schmunzie has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I seek your wisdom.

I have a script that sorts words by length and converts them to uppercase. Perl documentation tells me that length works in logical characters, not bytes but my results file tells me, for example, that żłóBżE is a ten-letter word all in upper case. I am less hopeful of automating the conversion of the non-Latin characters to upper case other than explicitly but surely I can expect the length function to work. I am clearly missing a step. Can you tell me what I'm doing wrong?

Thanks,

Schmunzie

Replies are listed 'Best First'.
Re: Polish text
by LanX (Saint) on Jan 03, 2021 at 18:12 UTC
    you must decode your input, or to be more precise Perl needs to know that you are handling char-strings and not byte-strings.

    For that to happen you need to convert the input's encoding format to Perl's internal text format.

    from Encode

    $string = decode(ENCODING, OCTETS[, CHECK])

    And this can already happen in an input layer when opening.

    see perluniintro#Unicode-I/O

    No better help, since I don't know your input's encoding.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Thank you, Rolf, that fixed both my problems.
Re: Polish text
by 1nickt (Canon) on Jan 03, 2021 at 20:02 UTC

    Hi Shmunzie, try this:

    perl -MEncode -E 'my $text="żłóBżE"; $text = decode_utf8($text); say length($text); say encode_utf8(uc($text))'

    (Note: As pointed out earlier you did not specify the encoding you have your text in, this assumes utf8 for which Encode provides sugar functions.)

    The way forward always starts with a minimal test.
Re: Polish text
by kcott (Archbishop) on Jan 04, 2021 at 01:21 UTC

    G'day Schmunzie,

    Using this common alias of mine:

    $ alias perlu alias perlu='perl -Mstrict -Mwarnings -Mautodie=:all -Mutf8 -C -E'

    I get a length of six characters reported, and converting to uppercase has no apparent problems:

    $ perlu 'say length("żłóBżE"); say uc("żłóBżE")'
    6
    ŻŁÓBŻE
    

    Of course, I have no idea if your problem word was embedded in your script as literal Polish characters, \x{...} characters, or \N{...} characters. Perhaps you read it from the command line; perhaps it came from an input file.

    — Ken