comment on

If the data really is in Japanese, then Encode::Guess is likely to have a very good chance of figuring out exactly what sort of encoding is being used. The various possible encodings are sufficiently distinct from each other that the logic for identifying one vs. the other can be quite reliable.

For that matter, I can easily look at the standard unicode-to-nonunicode mapping tables (available from http://www.unicode.org/Public/MAPPINGS/, and see that there is only one non-unicode encoding where 0x8141 maps to U+3001 "IDEOGRAPHIC COMMA" -- and that happens to be cp932. (updated to make the unicode.org link more specific)

In any case, the one thing you DO NOT want to do is anything like this on a "raw" string:

split( /\x81\x41/, $txt );
[download]

That's because there is a reasonable chance that this 2-byte sequence could occur such that the "\x81" is actually the second byte of some other two-byte character, rather than being the first byte of a "wide comma". The result will be that you split in the middle of a wide character, and the data you get will be trashed. (I know this from personal experience -- Perl 5.8 was a God-send for me.)

Find out (or figure out) what the encoding really is, use Encode to covert it to a utf8 string, find out the unicode code point for your comma character, and split on that. Assuming my deduction about cp932 is correct, then something like this will do the right thing:

split /\x{3001}/, decode( "cp932", $txt );
[download]

(updated to fix a typo in the charset name)

No possibility of "false-alarm" (mis)matches that way. You can easily convert back to cp936 for output if you want, but any string manipulation within your perl script is best done on utf8 data.

In reply to Re: parsing non english by graff
in thread parsing non english by arcnon

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.