bogdan77 has asked for the wisdom of the Perl Monks concerning the following question:
As you can see, I try to determine the delimiters needed for the split to be anything else except the characters included in this interval --> "-A-Z®À-ÖØ-öù-žǍ-ȚḀ-ẛẠ-ỹ™", but I still get unwanted punctuation characters added to words... There's no overlaping between these characters (i.e., they are in the proper order, with "-" first at 0045 and "™" (the trademark symbol) last at 8482, if i remember correctly :^) So... what am I doing wrong? Please enlighten me... (I have Perl 5.8.8 on Mac OS X, if that matters)#!/usr/bin/perl usewarnings; $textBlock="Belgium is a monarchy. Ionuț, close the door. The bui +ldings in Japan. This is a sky-blue* \"material\". Here's the list: The Eire Canal† is old… Is Apple™ only a computer company? www.fjydhjfjxerhuir.com is the site — go there. old-man-of-the-woods (forget-me-not) [digital] bogus_address99@geemail.com She's there. replace «date» They live in a kraal¹. façade Mac OS X® is slated to ship in May. Überzone șăâțî http://language.perl.com/faq"; @wordList=split/[^-A-Z®À-ÖØ-öù-žǍ-ȚḀ-ẛẠ-&# +7929;™]+/i, $textBlock; foreach (@wordList) { print "$_\n"; }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: split text into words -- Unicode problem (I guess)
by andye (Curate) on Mar 29, 2007 at 14:13 UTC | |
|
Re: split text into words -- Unicode problem (I guess)
by dk (Chaplain) on Mar 29, 2007 at 14:26 UTC | |
by bogdan77 (Initiate) on Mar 29, 2007 at 14:49 UTC | |
by bogdan77 (Initiate) on Mar 29, 2007 at 14:55 UTC | |
by Anonymous Monk on Mar 31, 2007 at 10:02 UTC | |
by dk (Chaplain) on Mar 29, 2007 at 15:10 UTC | |
by bogdan77 (Initiate) on Mar 29, 2007 at 15:25 UTC |