Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

validating unicode chars in their smallest form

by damian45 (Novice)
on Jun 13, 2010 at 01:18 UTC ( [id://844384]=perlquestion: print w/replies, xml ) Need Help??

damian45 has asked for the wisdom of the Perl Monks concerning the following question:

hi monks

I'm validating some mixed English and Japanese utf-8 input . It sometimes contains a-z A-Z 0-9 entered not only from the common ascii compatible unicode range, but also this unicode range xFF10 - xFF5E

http://en.wikibooks.org/wiki/Unicode/Character_reference/F000-FFFF

for example

A (unicode x0041)

vs

A (unicode xFF21 http://www.decodeunicode.org/u+FF21)

I understanding that to be safe I need to interpret unicode characters I accept only as their smallest unicode representation
e.g interpret xFF21 as x0041 (as in above)

So question is, can I use some function/module of Perl to do this, or do I have to manually convert them with a mapping. All the experimenting I've done so far, it seems like I'll have to manually do it. This surprise me if I is supposed to interpret them in their smaller representation.

cheers for any feedback, sorry for my english

damian
  • Comment on validating unicode chars in their smallest form

Replies are listed 'Best First'.
Re: validating unicode chars in their smallest form
by ikegami (Patriarch) on Jun 13, 2010 at 01:43 UTC

    I understanding that to be safe I need to interpret unicode characters I accept only as their smallest unicode representation

    It almost sounds as if you're talking about composed form (LATIN CAPITAL LETTER E WITH ACUTE) vs decomposed form (LATIN CAPITAL LETTER E + COMBINING ACUTE ACCENT). Unicode::Normalize can convert between the two. That has nothing to do with FULLWIDTH LATIN CAPITAL LETTER A vs LATIN CAPITAL LETTER A (although the "K" functions might do that).

    But if you're trying to avoid spoofs, the advice is to refuse string ("words"?) that have characters from multiple scripts.

    The post as it was when I replied:
    hi monks

    I'm validating some mixed English and Japanese utf-8 input . It sometimes contains a-z A-Z 0-9 entered not only from the common ascii compatible unicode range, but also this unicode range xFF10 - xFF5E

    http://en.wikibooks.org/wiki/Unicode/Character_reference/F000-FFFF

    for example

    A (unicode x0041)

    vs

    A (unicode xFF21 http://www.decodeunicode.org/u+FF21)

    I understanding that to be safe I need to interpret unicode characters I accept only as their smallest unicode representation
    e.g interpret xFF21 as x0041 (as in above)

    So question is, can I use some function/module of Perl to do this, or do I have to manually convert them with a mapping. All the experimenting I've done so far, it seems like I'll have to manually do it. This surprise me if I is supposed to interpret them in their smaller representation.

    cheers for any feedback, sorry for my english

    damian

Re: validating unicode chars in their smallest form
by ikegami (Patriarch) on Jun 13, 2010 at 18:23 UTC

    Hmm that seems like it might address a slightly different issue, but I will read more about that.

    Compare www.paypаl.com vs www.paypal.com. On this computer, it's impossible to tell that the two strings are different by looking at them. On another computer, it's very hard.

    This is concern #1 of mine Using UTF-8 Encoding to Bypass Validation Logic

    This has nothing to do with Unicode. It's strictly a UTF-8 problem. Since you should decode text before working with it, there's no problem.

    Take U+00C9. It could be encoded as

    C3 89
    or as
    E0 83 89

    If you work with the characters in their encoded form, they appear to be two different characters. Once you decode them, you're only dealing with character U+00C9. There's no problem if your decoder works properly.

    (A quick test shows that Encode only accepts the shortest form. The longer forms are treated as invalid. That's good.)

Re: validating unicode chars in their smallest form
by Xilman (Hermit) on Jun 13, 2010 at 10:23 UTC
    "cheers for reply

    Hmm that seems like it might address a slightly different issue, but I will read more about that.

    To try to summarize more, This is concern #1 of mine Using UTF-8 Encoding to Bypass Validation Logic

    Then #2 would be that it's not practical to have two different unicode characters for the same letter in my script, that starts to present other problems.

    thanks for time"

    It looks like the original question has been edited away. Please don't do that. It prevents others from learning from what your post originally contained and from the answers to it.

    Paul

      It looks like the original question has been edited away. Please don't do that. It prevents others from learning from what your post originally contained and from the answers to it. Paul

      Oh no! not my intent I reply from small screen device and didn't realize, sorry. I have updated the message.

      I think I've decided to replace and ascii compatible unicode with the larger latin equivalents I mention. I'm not really sure what's correct, it seems like going to smaller is correct. However I get data from users in both formats depending on the personal habits/how their pc is setup, so.

      thanks

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://844384]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-04-19 03:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found