encoding data containing non english characters so as to be decoded by non perl program

ranjan_jajodia has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
I have a perl program which grabs data from a few sites(which may have non english content) and print them to the console. There is another java program which captures this data for further processing. But due to the presence of non english language characters i guess i have to encode the data before printing as the java program reads byte by byte and hence won't work in the normal way. I have used HTML::Entities for encoding.
How do i decode it to get the original data in my java program? Am i using wrong method for the work? If yes then i will really like to know the correct way. Thanks,
Ranjan

Comment on encoding data containing non english characters so as to be decoded by non perl program

Replies are listed 'Best First'.
Re: encoding data containing non english characters so as to be decoded by non perl program by Jenda (Abbot) on Sep 22, 2004 at 11:51 UTC
I think you are asking at the wrong place. Try asking the poor souls at http://www.javajunkies.org what encoding are they able to decode and how. If necessary come back and ask how to encode the data into that encoding in Perl then. Jenda Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live. -- Rick Osborne	[reply]
Re^2: encoding data containing non english characters so as to be decoded by non perl program by ranjan_jajodia (Monk) on Sep 22, 2004 at 12:29 UTC
Hi Jenda, The problem is that i donot know how to find what kind of characters are present on a webpage ( from which i grabbed the data). Is there a rule of thumb regarding the data encoding that i get using useragent->get(). If i can have data in UTF-8 format i can use java ( read junk ;-) ) to decode it. Ranjan	[reply]
Re^3: encoding data containing non english characters so as to be decoded by non perl program by graff (Chancellor) on Sep 23, 2004 at 05:25 UTC
I'm no expert on this, but based on my own limited experience (and watching others getting broader experience with it), I'd say it depends on which (human) languages you're dealing with, how many web sites, and which web sites. If you're talking about sites/pages that are "mostly English with a few funny characters", the pages will often just use HTML entities (e.g. é for é, etc). Sometimes, the character encoding is specified somewhere in a MIME header or the HTML header, or maybe even in HTML comments. (This is typical if it's an open-standard character set, or a widely-used commercial one, like utf8, iso8859-whatever, Big5, ShiftJIS, GBK, etc.) Other times (especially when the site is presenting stuff in two or more lanuages on the same page), the character-set info is tucked away in font tags. Worst of all are the Southeast Asian languages (Hindi, Bengali, Tamil, etc) where the font rendering is kinda tough, and various major web sites come up with very different solutions -- i.e. incompatible font encodings -- which means that when you visit one of these sites the first time, you have to download their font in order to read the text. Converting this stuff to any sort of standard character set is a supreme pain in the a**. Basically, the answer is: there is no general solution -- but if your task is limited to a few sites/languages/character sets, you can get something to work within those bounds.	[reply]
Re^4: encoding data containing non english characters so as to be decoded by non perl program by ranjan_jajodia (Monk) on Sep 23, 2004 at 14:59 UTC


good chemistry is complicated, and a little bit messy -LW
	PerlMonks