paragkalra has asked for the wisdom of the Perl Monks concerning the following question:

I am actually using "Selenium" with "Perl". Selenium provides a method 'is_text_present' using which we can detect text on a web page.

My web page has some 'Hindi' text on it.

So in my code is I wanted to use that method e.g.
'$sel->is_text_present_ok("Some Hindi Text");'

But the Perl script is converting the Hindi text to series of 'question mark' characters shown below:

$sel->is_text_present_ok("??????");
  • Comment on Perl Modules for handling Non English text

Replies are listed 'Best First'.
Re: Perl Modules for handling Non English text
by ikegami (Patriarch) on Mar 30, 2009 at 19:39 UTC

    [ The OP has been replaced completely ]

    Your post is a "bit" light on details.

    But my Perl code doesn't recognises it.

    We haven't seen your code, making it rather hard to diagnose and fix the problem it has.

    Are there any PERL modules, using which I can handle the non-English text?

    The name of the language is Perl, for starters.

    What exactly do you want this module to do to the text? "Handle" could mean anything.

Re: Perl Modules for handling Non English text
by afoken (Chancellor) on Mar 30, 2009 at 19:26 UTC

    Please describe your problem a little bit more in detail. Show us the relevant code, the input, the actual output (including warning and error messages) and the expected output.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: Perl Modules for handling Non English text
by Marshall (Canon) on Mar 30, 2009 at 21:56 UTC
    Try http://perldoc.perl.org/perlunicode.html.

    Standard ASCII handles characters in the western alphabet including umlaut characters (like in German). But there are only 256 possibilities in 8 bits!. That's not enough for all languages and hence "wide characters", or 16 bit ones.

    In general you will find that it is possible to make the User Interface conform to national standards. But you will find that at the low level, western languages, in particular English is the norm.

    Perl is like any other computer language. You have to tell it how to interpret the byte stream - is each byte a character or is two bytes a character?

      That's not enough for all languages and hence "wide characters", or 16 bit ones.

      Perl's wide chars are 32-bit or 64-bit depending on the build, not 16.

      fmdev10$ perl -le'print ord "\x{FFFFFFFF}"' 4294967295
      persephone$ perl -le'print ord "\x{FFFFFFFFFFFFFFFF}"' 18446744073709551615

      Unicode currently requires 17 bits.

      is each byte a character or is two bytes a character?

      Or something else entirely, as in the following popular encodings: UTF-8 (1-4 bytes per char currently, 1-6 possible), UTF-16le/UTF-16be (2 or 4 bytes per char).

        My answer was based upon ANSI C:
        GCC on my Intel machine:
        #include <stdio.h> #include <stddef.h> int main () { printf ("hello world\n"); printf ("size of a wide char is %d bytes", sizeof(wchar_t) ); return (0); } /* prints: hello world size of a wide char is 2 bytes */
        $perl -le 'print ord "\x{FFFFFFFF}"'
        4294967295
        is just a 32 bit unsigned hex number.

        I don't know how many bits Hindi requires.

        Update: http://ascii-table.com/unicode.php shows unicode standards. This is complex. But basically 16 bits does it.

      Standard ASCII handles characters in the western alphabet including umlaut characters (like in German). But there are only 256 possibilities in 8 bits!
      ASCII is only 7 bits and does not include any accented characters at all. You are, perhaps, confusing it with ISO-8859-n.
        I think he's referring the "A" Windows system calls, where the "A" stands for ANSI (not ASCII) despite having very little to do with the ANSI character encodings (or the ASCII character encoding).