How to count string length with latin characters?

newblet has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to count string length with latin characters? by Joost (Canon) on Nov 02, 2006 at 00:04 UTC
The form is probably submitted using UTF-8 character encoding, which uses 2 bytes for latin accented characters. You have basically 2 ways of dealing with this: only allow submissions in latin-1 (which is the default perl character encoding) using the "accept-charset" property on the `<form>` tag, or make sure perl knows about your encoding: `use Encode qw(decode); my $real_string = decode("utf8",$input_string); #assumes $input string + is in UTF-8 encoding print "string is ".lenght($real_string)." characters\n"; # length() no +w interprets $real_string in character instead of bytes.` [download] There is a LOT of subtle stuff going on here. You should probably read Encode first. Maybe. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l] [select]
Re^2: How to count string length with latin characters? by rhesa (Vicar) on Nov 02, 2006 at 01:07 UTC
(...) only allow submissions in latin-1 (which is the default perl character encoding) using the "accept-charset" property on the <form> tag (...) That's a reasonable solution, so long as you keep in mind that characters beyond latin-1 will be submitted as encoded entities (e.g. `Ł`). Taking that into account in length calculations is a lot harder than consistently using utf8 across the board.	[reply] [d/l]
Re^3: How to count string length with latin characters? by Joost (Canon) on Nov 02, 2006 at 01:34 UTC
True enough, Perl's unicode support is pretty good, but if your input is "encodable" in latin-1, it will still save you a lot of headaches to use latin instead of unicode if you work with remote applications like databases - DBD::mysql's utf8 support is still extremely unreliable for instance. edit: Oh and also note that there is no standard for encoding characters outside the accepted charset: most popular browsers use the `&#NUMBER;` encoding, but not all of them do. See the replies to Unicode characters in <code> blocks for example "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l]
Re: How to count string length with latin characters? by GrandFather (Saint) on Nov 02, 2006 at 00:13 UTC
I'm confused. There are 40 characters in that string! Note too that the characters you have posted render as utf-8 characters (at least for me). The following may help however: `use strict; use warnings; use Encode; my $str = "הההההההההההההההההההההההההההההההההההההההה"; print 'Raw: ' . length $str; print "\nDecoded: " . length decode ('utf8', $str);` [download] Prints: `Raw: 80 Decoded: 40` [download] If the encoding you are using really is latin1 then using `decode ('iso-8859-1', $str)` may turn the trick for you. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: How to count string length with latin characters? by newblet (Novice) on Nov 02, 2006 at 00:17 UTC
ah stupid me... sorry i meant 40 but results in 80, not 20 to 40 :) thanks for the input guys i'm gonna look into your help a little more and see how this works out, thanks again.	[reply]
Re: How to count string length with latin characters? by holcapek (Sexton) on Nov 02, 2006 at 08:28 UTC
If you expect your input to be utf8, `use utf8;` then and `length` will return the number of characters (not bytes) that string is composed of. If your input is to be in non-utf8 yet not in plain ASCII, you should `use encoding 'wanted_encoding`, `length` will work in "character" semantics as well. See encoding for more. If you want to know the number of bytes that string is composed of, or your input is in plain ASCII, you can force perl to work like this by `use bytes;`. See bytes for more.	[reply] [d/l] [select]