Perl & Unicode: state of the art?

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perl & Unicode: state of the art? by farang (Chaplain) on Oct 07, 2013 at 22:43 UTC
Is is possible to write a script that when fed a file containing properly formed Unicode text, it will count the number of words and sentences it contains? No! Languages of the world are way too complex. Unicode deals with text at the character and grapheme level, which is hard enough. It is silent on what constitutes a word or sentence. It is certainly possible in many cases to define "words" and "sentences" in a way appropriate to some particular expected text format in some known language, but even then there are usually exceptions. Take choroba's code which satisfies a given spec. Is Sports.ru one word or two? Is какое-то two words, as the code determines, or just one as Russian linguists would probably contend? Do all other languages handle hyphenated text similarly? Almost certainly not, as a general rule. The more text considered, the more edge cases and ambiguities arise, even within a single language. I am slowly but steadily working to handle Thai and Lao text in Perl. For these languages, sentences are generally delimited by whitespace, and individual words are not delimited at all in the text, but instead are delimited by syntactic rules. Code can and has been written to count individual Thai words, but it is considerably different and more complicated than counting the number of character strings between spaces.	[reply]
Re^2: Perl & Unicode: state of the art? by BrowserUk (Patriarch) on Oct 08, 2013 at 00:25 UTC
Thai and Lao text ... these languages, sentences are generally delimited by whitespace, and individual words are not delimited at all in the text, but instead are delimited by syntactic rules. So, fair to say that the first requirement to process Unicode 'text'; is to determine the language. So then the question becomes: given a file of Unicode text; can the language be determined? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^3: Perl & Unicode: state of the art? by LanX (Saint) on Oct 08, 2013 at 00:45 UTC
> can the language be determined? You know the answer, only with statistical certainty and dependent on the length of the text and the distance of languages. Hand and finger (en) <=> Hand und Finger (de) If same script lead to same delimiters can only be answered by someone knowing all 6000 languages of the world. But already Arabic words should be a problem, maybe less if transcribed. Chinese even more. see also Word_divider and Word#Word_boundaries Cheers Rolf ( addicted to the Perl Programming Language)	[reply]
Re^4: Perl & Unicode: state of the art? by BrowserUk (Patriarch) on Oct 08, 2013 at 02:16 UTC
Re^5: Perl & Unicode: state of the art? by Discipulus (Canon) on Oct 08, 2013 at 07:32 UTC
Re^3: Perl & Unicode: state of the art? by DrHyde (Prior) on Oct 08, 2013 at 10:35 UTC
Again, in the general case, no. There exist texts which are in multiple languages, which may have different syntactic rules. Sometimes the two languages are in separate volumes, or at least separate halves on a volume, but sometimes you'll get the two languages on opposite pages, or in two columns on each page, or even line by line translations. And very occasionally you'll even see line by line translations in more than two langauges. I have a book at home that is tri-lingual Greek/Latin/English, for example.	[reply]
Re: Perl & Unicode: state of the art? by choroba (Cardinal) on Oct 07, 2013 at 15:54 UTC
It should be possible. Define words and sentences, though :-) لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re^2: Perl & Unicode: state of the art? by BrowserUk (Patriarch) on Oct 07, 2013 at 16:25 UTC
Define words and sentences, How about starting with the simplest possible definitions: Words: whitespace delimited sequences of letters. Sentences: sets of words delimited by a full stop. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^3: Perl & Unicode: state of the art? by choroba (Cardinal) on Oct 07, 2013 at 20:43 UTC
Words are usually delimited by punctuation, not only whitespace. Therefore, the following script only counts letters, delimited by non-letters. `#!/usr/bin/perl use warnings; use strict; use open IO => ':utf8', ':std'; my ($words, $sentences); while (<>) { $words++ for m/\p{L}+/g; $sentences++ for m/\./g; } print "$words $sentences\n";` [download] Tested on the following text: Огонь XXII Зимних олимпийских игр в Сочи во второй раз погас в понедельник в Москве, во время этапа эстафеты олимпийского огня. После нескольких безуспешных попыток снова его зажечь, факел был заменен, передает портал Sports.ru. Казус произошел на Раушской набережной, недалеко от Кремля. Видно, как зрители приветствуют факелоносца, он машет в ответ, и через какое-то время факел гаснет. Output: `59 5` [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re^4: Perl & Unicode: state of the art? by Jenda (Abbot) on Oct 08, 2013 at 14:53 UTC
Re^4: Perl & Unicode: state of the art? by BrowserUk (Patriarch) on Oct 07, 2013 at 22:42 UTC
Re^5: Perl & Unicode: state of the art? by choroba (Cardinal) on Oct 07, 2013 at 22:48 UTC
Re^3: Perl & Unicode: state of the art? by DrHyde (Prior) on Oct 08, 2013 at 10:18 UTC
How many sentences are there in these examples? He said "I like pie. I also like tickles." He said "The pie cost me £2.30." I like pie Who watches the watchers? Some will argue that something like "court martial" is a single word, despite having a space in. And then there are all those pesky non-European languages. Apparently Chinese uses the same space between words as between characters. So the answer is "probably not", in the general case.	[reply]
Re: Perl & Unicode: state of the art? by zwon (Abbot) on Oct 07, 2013 at 16:01 UTC
It depends on language. In some languages it is not trivial to detect word boundaries, no matter what encoding you are using.	[reply]
Re: Perl & Unicode: state of the art? by choroba (Cardinal) on Oct 08, 2013 at 16:00 UTC
Give me a script that works for ASCII or Latin-1, I will show you how to adapt it to Unicode. In other words - Unicode is a character encoding, not a NLP (Natural Language Processing) framework. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]