moritz,
I might add that in human language \w+ is usually both too broad (because it contains the underscore _ and digits, which usually don't appear in words in human language) and too narrow (don't), and in general word recognition is language dependent.

I already mentioned that in different words - do you think it needs to be more clear?

The seeker may want to define their own character class - perhaps to remove _ and 0-9 from \w but to add apostrophe and hyphen to match words like "don't" and "president-elect" and to not match words like "th_500X". You will need to point out that they are still going to match "Z-''-Z".

Regarding the Unicode comment. I mentioned encoding and dealing with foreign languages in passing. The reason being is because the same pitfalls still happen. If you only want "real" words in some dictionary - properly handling unicode by itself is not going to fix the problem of "aaaaaa" and the grave accent counterpart failing.

In other words, I am saying that what the seeker may want is very subjective and only they can answer the questions necessary to provide an adequate solution. Forgetting to mention encoding will complicate the problem but knowing about it won't necessarily make the problem go away - the entire picture is required.

Thank you for your comment and the unicode link - a good tool indeed.

Cheers - L~R


In reply to Re^2: What Is A Word? by Limbic~Region
in thread What Is A Word? by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.