Just let me note that if you don't define the encoding of the the filehandle you're reading from (DATA here) then the strings you read in will be byte strings and then matching a unicode class such as /\p{Alnum}/ won't make much sense on them. In this case, perl will act as if the string would be iso_8859_1-encoded. (You can call this a bug or a feature.) This might not work with text of a different encoding, such as iso_8859_2. It will accidentally work with Hungarian text encoded as iso_8859_2, as the only Hungarian letters not in 8859_1 are \x{151}, \x{150}, \x{171}, \x{170} which are in positions \xf5, \xd5, \xfb, \xdb, which are letters (although different letters) in 8859_1. However, other languages use letters such as \x{15b}, which is encoded to 8859_2 as \xb6, and that's a non-alnum symbol in 8859_1. With other encodings, such as utf-8, you'll probably have even more serious failures.

If you want to match letters in non-ascii texts, you have two options. One is to set the encoding of the filehandle with either binmode, 3-arg open open the encoding pragma, the -C command line option, the PERLIO env-var, or some other way; or decode the string with the Encode module after reading. The other is to stay with byte string, set the correct locale with the environment variables (the locale has information about the character set, like what chars are alphabetic etc), use locale; to make the matching locale-aware, and match for /\w/ or /[[:alnum:]]/

Update: for peacekorea: please don't let this discussion confuse or frighten you, it's not quite important for the original goal. I'd just like to spread information about internationalization for the Americian monk who thinks naïvely thinks other languages all use 8859_1 just a handful of accented letters.


In reply to Re^2: a question about making a word frequency matrix by ambrus
in thread a question about making a word frequency matrix by peacekorea

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.