Re^2: Perl & Unicode: state of the art?

Thai and Lao text ... these languages, sentences are generally delimited by whitespace, and individual words are not delimited at all in the text, but instead are delimited by syntactic rules.

So, fair to say that the first requirement to process Unicode 'text'; is to determine the language.

So then the question becomes: given a file of Unicode text; can the language be determined?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^2: Perl & Unicode: state of the art?

Replies are listed 'Best First'.
Re^3: Perl & Unicode: state of the art? by LanX (Saint) on Oct 08, 2013 at 00:45 UTC
> can the language be determined? You know the answer, only with statistical certainty and dependent on the length of the text and the distance of languages. Hand and finger (en) <=> Hand und Finger (de) If same script lead to same delimiters can only be answered by someone knowing all 6000 languages of the world. But already Arabic words should be a problem, maybe less if transcribed. Chinese even more. see also Word_divider and Word#Word_boundaries Cheers Rolf ( addicted to the Perl Programming Language)	[reply]
Re^4: Perl & Unicode: state of the art? by BrowserUk (Patriarch) on Oct 08, 2013 at 02:16 UTC
You know the answer Nope. If I knew, I wouldn't be asking. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^5: Perl & Unicode: state of the art? by Discipulus (Canon) on Oct 08, 2013 at 07:32 UTC
Well come back to Babel, brothers.. Languages are live things, poetry is a valid form of a language. Processors are mechanicals things: no way to cover all the cases. Perl is digital and my brain is analogical. no hope, sorry there are no rules, there are no thumbs..	[reply] [d/l]
Re^3: Perl & Unicode: state of the art? by DrHyde (Prior) on Oct 08, 2013 at 10:35 UTC
Again, in the general case, no. There exist texts which are in multiple languages, which may have different syntactic rules. Sometimes the two languages are in separate volumes, or at least separate halves on a volume, but sometimes you'll get the two languages on opposite pages, or in two columns on each page, or even line by line translations. And very occasionally you'll even see line by line translations in more than two langauges. I have a book at home that is tri-lingual Greek/Latin/English, for example.	[reply]