Everyone seems to have lept to the assumption that your "text file with some weird characters" in it is in the Unicode coded character set.

"Unicode coded character set" makes no sense.

If you simply meant "Unicode character set"If you really meant "Unicode coded character set", then I don't see the problem. Unicode is the only character set understood by Perl builtins and its regex engine. (Well, maybe US-ASCII too depending on how you look at it.) It doesn't make any sense to talk about other character sets.

But then you mention "Windows-1252 character encoding" as a possible alternative to "Unicode coded character set". That would make "Unicode coded character set" some kind of encoding, but a character set is not an encoding. Perhaps you meant "UTF-8 encoding".

If you meant "UTF-8 encoding", then you're wrong about everyone assuming the input was encoded using UTF-8. I, for one, made no assumption whatsoever about the encoding of the input.

(I did assume that $word contained text, but I stated that assumption.)

The former is a multi-byte encoding and the latter is a single-byte encoding. The difference is fundamental.

Not at all. If you want to deal with text, you have to decode the input. It doesn't matter one bit whether it's encoded using a single-byte fixed-width (e.g. Windows-1252), a multiple-byte fixed-width (e.g. UCS-2le) or a variable-width encoding (e.g. UTF-8, UTF-16le).

So, first, you need to know whether your text file is in some encoding form of Unicode (e.g., UTF-8) or in the Windows-1252 character encoding—or even possibly in some other legacy encoding.

That should read: "First, you need to know the encoding of the text file (e.g. UTF-8, Windows-1252, etc)."

Most definitely. In order to have text, you need to decode the input, and you can't do that until you know what encoding was used to produce those bytes.


In reply to Re^2: regexing for non-standard characters... by ikegami
in thread regexing for non-standard characters... by emmiesix

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.