newblet has asked for the wisdom of the Perl Monks concerning the following question:

I have this html form that takes in an input of up to 40 characters. But if someone puts a latin character, it is always takes up 2 characters. For example, if I put "הההההההההההההההההההההההההההההההההההההההה" into the field and submit the form, calling length($input) returns 40 (should be 20). I've googled forever and I thought use bytes might be my answer, but it isn't i think. any help would be appreciated on how to properly get length 20 instead of 40. btw i'm using perl 5.8
  • Comment on How to count string length with latin characters?

Replies are listed 'Best First'.
Re: How to count string length with latin characters?
by Joost (Canon) on Nov 02, 2006 at 00:04 UTC
    The form is probably submitted using UTF-8 character encoding, which uses 2 bytes for latin accented characters.

    You have basically 2 ways of dealing with this: only allow submissions in latin-1 (which is the default perl character encoding) using the "accept-charset" property on the <form> tag, or make sure perl knows about your encoding:

    use Encode qw(decode); my $real_string = decode("utf8",$input_string); #assumes $input string + is in UTF-8 encoding print "string is ".lenght($real_string)." characters\n"; # length() no +w interprets $real_string in character instead of bytes.
    There is a LOT of subtle stuff going on here. You should probably read Encode first. Maybe.
      (...) only allow submissions in latin-1 (which is the default perl character encoding) using the "accept-charset" property on the <form> tag (...)
      That's a reasonable solution, so long as you keep in mind that characters beyond latin-1 will be submitted as encoded entities (e.g. &#321;). Taking that into account in length calculations is a *lot* harder than consistently using utf8 across the board.
Re: How to count string length with latin characters?
by GrandFather (Saint) on Nov 02, 2006 at 00:13 UTC

    I'm confused. There are 40 characters in that string!

    Note too that the characters you have posted render as utf-8 characters (at least for me).

    The following may help however:

    use strict; use warnings; use Encode; my $str = "הההההההההההההההההההההההההההההההההההההההה"; print 'Raw: ' . length $str; print "\nDecoded: " . length decode ('utf8', $str);

    Prints:

    Raw: 80 Decoded: 40

    If the encoding you are using really is latin1 then using decode ('iso-8859-1', $str) may turn the trick for you.


    DWIM is Perl's answer to Gödel
      ah stupid me... sorry i meant 40 but results in 80, not 20 to 40 :) thanks for the input guys i'm gonna look into your help a little more and see how this works out, thanks again.
Re: How to count string length with latin characters?
by holcapek (Sexton) on Nov 02, 2006 at 08:28 UTC
    If you expect your input to be utf8, use utf8; then and length will return the number of characters (not bytes) that string is composed of.

    If your input is to be in non-utf8 yet not in plain ASCII, you should use encoding 'wanted_encoding, length will work in "character" semantics as well. See encoding for more.

    If you want to know the number of bytes that string is composed of, or your input is in plain ASCII, you can force perl to work like this by use bytes;. See bytes for more.