Re: How to count string length with latin characters?

The form is probably submitted using UTF-8 character encoding, which uses 2 bytes for latin accented characters.

You have basically 2 ways of dealing with this: only allow submissions in latin-1 (which is the default perl character encoding) using the "accept-charset" property on the <form> tag, or make sure perl knows about your encoding:

use Encode qw(decode);
my $real_string = decode("utf8",$input_string); #assumes $input string
+ is in UTF-8 encoding
print "string is ".lenght($real_string)." characters\n"; # length() no
+w interprets $real_string in character instead of bytes.
[download]

There is a LOT of subtle stuff going on here. You should probably read Encode first. Maybe.

"What should it profit a man, if he should win a flame war, yet lose his cool?"

Comment on Re: How to count string length with latin characters? Select or Download Code

Replies are listed 'Best First'.
Re^2: How to count string length with latin characters? by rhesa (Vicar) on Nov 02, 2006 at 01:07 UTC
(...) only allow submissions in latin-1 (which is the default perl character encoding) using the "accept-charset" property on the <form> tag (...) That's a reasonable solution, so long as you keep in mind that characters beyond latin-1 will be submitted as encoded entities (e.g. `Ł`). Taking that into account in length calculations is a lot harder than consistently using utf8 across the board.	[reply] [d/l]
Re^3: How to count string length with latin characters? by Joost (Canon) on Nov 02, 2006 at 01:34 UTC
True enough, Perl's unicode support is pretty good, but if your input is "encodable" in latin-1, it will still save you a lot of headaches to use latin instead of unicode if you work with remote applications like databases - DBD::mysql's utf8 support is still extremely unreliable for instance. edit: Oh and also note that there is no standard for encoding characters outside the accepted charset: most popular browsers use the `&#NUMBER;` encoding, but not all of them do. See the replies to Unicode characters in <code> blocks for example "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l]