comment on

You need to look at how your MySQL database table is defined, and how the text field contents are actually stored in it. Use some client other than a cgi script to investigate this, if the nature of the storage is not clear to you. For example:

#!/usr/bin/perl

use strict;
use DBI;
use Encode;

# set up connection parameters $dbname, $host, $user, $password, then:

my $dbh = DBI->connect( "DBI::mysql:database=$dbname;host=$host", $use
+r, $password );

my $sth = $dbh->prepare( "select item_name from table1" );
$sth->execute;
my ( $is_wide_utf8, $is_not_utf8, $is_ascii );
while ( my ( $item ) = $sth->fetchrow_array )
{
    my $nbytes = length( $item );
    my $nchars = eval "length( decode( 'utf8', $item, Encode::FB_CROAK
+ ))";
    if ( $@ ) {  # decode croaked -- text string is not utf8
        $is_not_utf8++;
        # figure out something else to try (maybe shiftjis?)
    }
    elsif ( $nchars == $nbytes ) {
        $is_ascii++;
    }
    else {
        $is_wide_utf8++;
    }
}
print "$is_ascii entries found to be ascii-only\n" if ( $is_ascii );
print "$is_wide_utf8 entries found to be utf8\n" if ( $is_wide_utf8);
print "$is_not_uft8 entries found to be non-utf8\n"  if ( $is_not_utf8
+ );
[download]

If the results show that your database contains only ascii characters, it probably means that the numeric character entities (like "ஸ") are being stored in the table (you should print some of the entries to confirm this, and you should probably look into fixing things so that you store actual utf8 characters instead).

Next, in your cgi script, you need to figure out what sort of data is coming from your clients -- numeric entities? actual utf8 characters?

As for doing table searches for unicode strings based on input from a cgi client, the best setup is:

The database table stores utf8 strings in the relevant text fields
The cgi script converts the client input to a utf8 string
The query is prepared using a placeholder in the search condition
The query is executed with the utf8 string as the parameter value for the placeholder

In other words, the cgi script might do something like this:

my $srchString = $cgi-param( ... );
$srchString =~ s/\&#(\d+);/chr($1)/ge;  # convert numeric entities to 
+characters

$srchString .= '%' unless $srchString =~ /%$/;  # add a wildcard for t
+he search

my $sth = $dbh->prepare( "select item_name,blah from table1 where item
+_name like ?"
$sth->execute( $srchString );
my $rows = $sth->fetchall_arrayref;

for my $row ( @$rows ) {
    my ( $item, $blah ) = @$row;
    $item = decode( 'utf8', $item );  # make sure perl knows this is u
+tf8 data
    # ... do other stuff...
}
[download]

In reply to Re: how to change unicode string to byte array ? by graff
in thread how to change unicode string to byte array ? by nythu

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.