bhcesl has asked for the wisdom of the Perl Monks concerning the following question:

I have modified a program called edict to work with online language dictionaries. While the process for this may be completely obvious, I will enumerate it:

1. get user input
2. send the input to the web site for the search
3. parse the output from the website
4. print the translation to the screen.

Now for the problem. Sending a word with no accents works beautifully. However, words with accented characters never get a hit.

I have tried:

use Encode;
....
$_ = encode("iso-8859-1", $_);

before sending input to the website because I have read that the internet is mostly written using iso-8859, so I thought the web site might need iso input. That didn't work.

I am working at a Spanish site, so I entered the word 'araņa' and did a search manually. The site found the word in its database and gave me all the possible translations. I then used that word for input using edict and no results were returned. (this is getting kind of long, eh?)

So, the question is: where in this process are the characters getting muttled and what can I do about it?

Replies are listed 'Best First'.
Re: text encoding
by tachyon (Chancellor) on Sep 09, 2004 at 06:52 UTC

    You need to use URL encoding. You can use one of the URI modules or if you want to roll your own these are the algorithms. Essentially you encode as %HH where HH is the 2 char hex representation of the char code. Spaces can be encoded as %20 or +. To display 'odd' characters correctly in a browser you may need to encode them using HTML::Entities.

    sub url_decode { my ( $decode ) = @_; return '' unless defined $decode; $decode =~ tr/+/ /; $decode =~ s/%([a-fA-F0-9]{2})/ pack "C", hex $1 /eg; return $decode; } # RFC 1738 # Only alphanumerics [0-9a-zA-Z], the special characters $-_.+!*'(), # and reserved characters used for their reserved purposes # may be used unencoded within a URL. we encode more because of issues # that some browsers have with the RFC sub url_encode { my ( $encode ) = @_; return '' unless defined $encode; $encode =~ s/([^a-zA-Z0-9_. -])/ uc sprintf "%%%02x",ord $1 /eg; $encode =~ tr/ /+/; return $encode; }

    cheers

    tachyon