edis has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I have a problem searching/matching unicode strings. This code works:
use utf8; use strict; my $s1 = "long_japanese_text_ i_utf8"; my $s2 = "short_japanese_text_in_utf8"; print index($s1, $s2)."\n"; print "found\n" if $s1 =~ /$s2/;
Both index and regex work here. The problem appears when data comes from sjis encoded file:
use strict; use utf8; my $s1 = "short_japanese_text_in_utf8"; open F, "<:encoding(sjis)", "file.txt"; while (<F>) { print "found\n" if /$s1/; print index($_, $s1)."\n"; }
Neither regex nor index work here. What is the problem? What am I doing wrong? Thanks, Edvinas

Replies are listed 'Best First'.
Re: Unicode problem
by idsfa (Vicar) on Jun 16, 2005 at 05:48 UTC

    Make sure that you have the encoding you are looking for:

    use Encode; print join ('\n', Encode->encodings());

    You might also want to check the value of:

    is_utf8($_);

    Finally, it cannot hurt anything but performance to do:

    $uni = encode_utf8($_);

    The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. -- Cyrus H. Gordon
Re: Unicode problem
by mugwumpjism (Hermit) on Jun 16, 2005 at 21:50 UTC

    I found Unicode-MapUTF8 useful for explicitly converting between arbitrary character sets and UTF-8.

    $h=$ENV{HOME};my@q=split/\n\n/,`cat $h/.quotes`;$s="$h/." ."signature";$t=`cat $s`;print$t,"\n",$q[rand($#q)],"\n";
Re: Unicode problem
by graff (Chancellor) on Jun 17, 2005 at 03:25 UTC
    I have a dumb question (but who knows)...

    Are you really positive that "file.txt" actually contains at least one occurrence of "short_japanese_text_in_utf8", and that is really is encoded in sjis? (And are you being careful about checking for distinct white-space characters that might mess up the comparisons?)

    I'm not familiar with Japanese, but I wonder if there might be a problem because of the way Unicode handles this language. Because Japanese and Chinese (and Korean) use a number of "common" ideographic characters, Unicode has created a "unified CJK" character set, which represents, roughly, the union of ideographs used in the three languages. It's conceivable that some "ambiguities" might exist in the sjis-unicode mappings, such that one utf code point could reasonbly be used in place of two distinct sjis code points, or vice-versa.

    This could mean that two strings would look "the same" when viewed by a casual human reader, despite the use of one or another distinct code point.

Re: Unicode problem
by dakkar (Hermit) on Jun 19, 2005 at 09:55 UTC

    Works for me, Perl 5.8.2 on Linux. I wrote:

    #!/usr/bin/perl
    use utf8;
    
    open my $fh,'<:encoding(sjis)','j.sjis.txt';
    my $string='言葉';
    print index(<$fh>,$string);
    

    saving it as 'j.pl' as utf-8, and

    日本語の言葉
    

    saving it as 'j.sjis.txt' as shift-jis.

    Running it:

    $ perl j.pl j.sjis.txt
    4
    

    (yes, I used <pre> instead of <code>... but <code> would not let me use character entities, and without the proper characters this answer would be useless)

    -- 
            dakkar - Mobilis in mobile
    

    Most of my code is tested...

    Perl is strongly typed, it just has very few types (Dan)

      Sorry, it was my fault, the short lookup string was incorrect (I can't read Japanese, but need to work with it). Everything works fine now.

      Thanks all for the help.

      Edvinas