comment on

Hello all,

I have a script that gets data from Google Adsense. The data is in Unicode (UTF-16, I believe). When I try to pattern match on the data, I can only match one character. A pattern that looks for more than one character in sequence fails.

A typical line looks like:

5/18/05 184 7 3.8% 6.14 1.13

Matching \d works, but attempting to match \d{2}, \d+\/ or anything else that catches two characters in sequence fails. I take it this is because Unicode uses more than one byte per character.

I'm only extracting data from this Unicode text, and do not need to output Unicode. Why don't the regexps work? If they're not supposed to work, how can I convert the text to ISO-8859-1/Latin1? I tried converting using iconv, but to no avail (would return UTF-16 regardless of args (used -f UTF-16 -t UTF-8).

Thanks in advance for your help.

In reply to Unicode and Regexps: convert or am I missing something? by newrisedesigns

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.