Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

question on encoding

by Anonymous Monk
on Jan 24, 2007 at 15:42 UTC ( [id://596267]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I am posting serialized data to a url. sometimes the data may contain french word and it gets serialized as well. such as énfasis (incorrected french, i know) is escaped as %C3%A9nfasis and becomes énfasis when inserted to database.

the webapp is using iso-8859-1 as the encoding.

from my limited knowledge on this realm, i am guessing that the url is enocded as utf-8 and i need to convert %C3%A9nfasis to iso-8859-1 before inserting to db. am i right?

i tried using encode("iso-8859-1","%C3%A9nfasis") from Encode which doesn't do anything. I am probably poking in the dark and would appreciate some help from you.

thanks.

James.

Replies are listed 'Best First'.
Re: question on encoding
by Ieronim (Friar) on Jan 24, 2007 at 16:02 UTC
    You need something like this:
    #!/usr/bin/perl use warnings; use strict; use Encode; my $uri = '%C3%A9nfasis'; $uri =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg; $uri = decode_utf8($uri); $uri = encode("iso-8859-1", $uri); print $uri;
    It prints énfasis.

         s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print
Re: question on encoding
by ferreira (Chaplain) on Jan 24, 2007 at 16:39 UTC

    As Ieronim pointed at Re: question on encoding, inputs like "%C3%A9nfasis" are URI-escaped, so you need to translate them to bytes. You can do it by hand or use URI::Escape:

    use URI::Escape; my $uri = '%C3%A9nfasis'; my $octets = uri_unescape($uri);

    Then you interpret those bytes as a UTF-8 string:

    my $s = decode_utf8($octets);

    to finally coerce it to ISO-8859-1 via:

    my text = encode("iso-8859-1", $s);

    With regard to

    from my limited knowledge on this realm, i am guessing that the url is enocded as utf-8 and i need to convert %C3%A9nfasis to iso-8859-1 before inserting to db. am i right?
    that's all right if your database waits for the text to be in ISO-8859-1. Most databases have some default encoding and some columns may have a declared encoding (which overrides the database default). And yet there can be options in the SQL statements to control the encoding of the text being fed to your tables. Try it out and, if you got into trouble, bring the issue here and tell more about the database and settings you're using.
      thanks for the reply and explanation. the insertion works now.

      a minor problem: i want to distinguish between french word and english word, only do the decode, encode operation when it is french. but the regex /%[0-9A-Fa-f]{2}/ doesn't catch them.

      if ( /%[0-9A-Fa-f]{2}/ ) { # 1. # my $escaped = uri_unescape( $_ ); same effect as the RE +but slower s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg; # 2. my $s = decode_utf8( $_ ); # 3. $s = encode("iso-8859-1", $s); push @new_words, $s; } else { push @new_words, $_; }
      in the case of 'énfasis' i peeked into the url submission and the data arrived to the perl program. they are different: it is %C3%A9nfasis during submission. but it becomes énfasis after i grab the value through CGI.pm's param method.

      for now, i am taking off the if .. else part and doing encode/decode_utf8 on every word i received, not a good solution i felt.

        i want to distinguish between french word and english word, only do the decode, encode operation when it is french.

        You want something like this, then:

        s/%([0-9a-f]{2}/chr(hex($1))/egi; if ( /[\x80-\xff]/ ) { push @new_words, encode( "iso-8859-1", decode_utf8( $_ )); } else { push @new_words, $_; }
        The point there is that you only need to do the encoding conversion if the string happens to contain any bytes with the 8th bit set (i.e. bytes in the numeric range 128-255).

        Update: be aware that for this sort of approach, if the input data happen to contain any characters that are not in the iso-8859-1 table (e.g. certain "smart quote" characters, or Greek or Russian or ...), you'll get "?" instead of the intended characters as a result of the "encode(iso-8859-1)" call. That's just a limitation you have to live with if you have to stick with that old "legacy" iso-8859 encoding.

        From the HTML 4.01 Specification:
        The content type "application/x-www-form-urlencoded" is inefficient for sending large quantities of binary data or text containing non-ASCII characters. The content type "multipart/form-data" should be used for submitting forms that contain files, non-ASCII data, and binary data.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://596267]
Approved by Ieronim
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2024-04-19 08:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found