Re^2: UTF8 Validity


Syntactic Confectionery Delight
	PerlMonks

Re^2: UTF8 Validity

by menolly (Hermit)

on Feb 22, 2008 at 00:47 UTC ( [id://669440]=note: print w/replies, xml )

Need Help??

in reply to Re: UTF8 Validity
in thread UTF8 Validity

Thanks; that's the kind of pointer I need. Most of my non-ASCII/non-UTF8 data is either in contact data or easily connected to contact data, so I've been trying to guess the charset based on the geographic origin, with mixed results. I definitely have multiple encodings present -- so far, there's cp1251 (Cyrillic), latin1, some form of Japanese, and something I can't identify but have scrubbed out in the source DB.

Comment on Re^2: UTF8 Validity

Replies are listed 'Best First'.
Re^3: UTF8 Validity by graff (Chancellor) on Feb 22, 2008 at 02:18 UTC
Encode::Guess is likely to be helpful for figuring out the source encodings for many of the Asian (multi-byte-char) strings, though it might not help much for distinguishing among single-byte encodings. Worth a try.	[reply]
Re^4: UTF8 Validity by Anonymous Monk on Feb 22, 2008 at 11:07 UTC
Encode::Guess is lame because the user needs to tell it which encoding the binary is. Use Encode::Detect instead. This is the same detector used in Mozilla browsers.	[reply]
Re^5: UTF8 Validity by menolly (Hermit) on Feb 22, 2008 at 18:23 UTC
I've been using Encode::Guess, but have had trouble building a suspects list for some data. However, Firefox hasn't been able to appropriately handle the problem data, either, so if Encode::Detect is the same method, I doubt it would've done any better on this data.	[reply]

In Section Seekers of Perl Wisdom

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: note [id://669440]
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others drinking their drinks and smoking their pipes about the Monastery: (3)

As of 2024-04-24 17:37 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found