Re: Perl & Unicode: state of the art?
by farang (Chaplain) on Oct 07, 2013 at 22:43 UTC
|
Is is possible to write a script that when fed a
file containing properly formed Unicode text, it will count the
number of words and sentences it contains? No!
Languages of the world are way too complex. Unicode deals with text
at the character and grapheme level, which is hard enough. It is
silent on what constitutes a word or sentence. It is certainly
possible in many cases to define "words" and "sentences" in a way
appropriate to some particular expected text format in some known
language, but even then there are usually exceptions. Take
choroba's code which satisfies a given spec. Is Sports.ru
one word or two? Is какое-то two words, as the code
determines, or just one as Russian linguists would probably contend?
Do all other languages handle hyphenated text similarly? Almost certainly not,
as a general rule. The more text considered, the more edge cases
and ambiguities arise, even within a single language. I am
slowly but steadily working to handle Thai and Lao text in Perl.
For these languages, sentences are generally delimited by
whitespace, and individual words are not delimited at all in the
text, but instead are delimited by syntactic rules. Code can and
has been written to count individual Thai words, but it is
considerably different and more complicated than counting the
number of character strings between spaces.
| [reply] |
|
Thai and Lao text ... these languages, sentences are generally delimited by whitespace, and individual words are not delimited at all in the text, but instead are delimited by syntactic rules.
So, fair to say that the first requirement to process Unicode 'text'; is to determine the language.
So then the question becomes: given a file of Unicode text; can the language be determined?
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
> can the language be determined?
You know the answer, only with statistical certainty and dependent on the length of the text and the distance of languages.
Hand and finger (en) <=> Hand und Finger (de)
If same script lead to same delimiters can only be answered by someone knowing all 6000 languages of the world.
But already Arabic words should be a problem, maybe less if transcribed. Chinese even more.
see also Word_divider and Word#Word_boundaries
Cheers Rolf
( addicted to the Perl Programming Language)
| [reply] |
|
|
|
Again, in the general case, no. There exist texts which are in multiple languages, which may have different syntactic rules. Sometimes the two languages are in separate volumes, or at least separate halves on a volume, but sometimes you'll get the two languages on opposite pages, or in two columns on each page, or even line by line translations. And very occasionally you'll even see line by line translations in more than two langauges. I have a book at home that is tri-lingual Greek/Latin/English, for example.
| [reply] |
Re: Perl & Unicode: state of the art?
by choroba (Cardinal) on Oct 07, 2013 at 15:54 UTC
|
It should be possible. Define words and sentences, though :-)
| [reply] |
|
| [reply] |
|
Words are usually delimited by punctuation, not only whitespace. Therefore, the following script only counts letters, delimited by non-letters.
#!/usr/bin/perl
use warnings;
use strict;
use open IO => ':utf8', ':std';
my ($words, $sentences);
while (<>) {
$words++ for m/\p{L}+/g;
$sentences++ for m/\./g;
}
print "$words $sentences\n";
Tested on the following text:
Огонь XXII Зимних олимпийских игр в Сочи во второй раз погас в понедельник в Москве, во время этапа эстафеты олимпийского огня. После нескольких безуспешных попыток снова его зажечь, факел был заменен, передает портал Sports.ru.
Казус произошел на Раушской набережной, недалеко от Кремля. Видно, как зрители приветствуют факелоносца, он машет в ответ, и через какое-то время факел гаснет.
Output:
59 5
| [reply] [d/l] [select] |
|
|
|
|
How many sentences are there in these examples?
- He said "I like pie. I also like tickles."
- He said "The pie cost me £2.30."
- I like pie
- Who watches the watchers?
Some will argue that something like "court martial" is a single word, despite having a space in.
And then there are all those pesky non-European languages. Apparently Chinese uses the same space between words as between characters.
So the answer is "probably not", in the general case.
| [reply] |
Re: Perl & Unicode: state of the art?
by zwon (Abbot) on Oct 07, 2013 at 16:01 UTC
|
It depends on language. In some languages it is not trivial to detect word boundaries, no matter what encoding you are using. | [reply] |
Re: Perl & Unicode: state of the art?
by choroba (Cardinal) on Oct 08, 2013 at 16:00 UTC
|
Give me a script that works for ASCII or Latin-1, I will show you how to adapt it to Unicode. In other words - Unicode is a character encoding, not a NLP (Natural Language Processing) framework.
| [reply] |