Language recognition of web pages

johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Hi, How does one recognize the character encoding of a webpage? It's ok if the server follows any of these standards:

Use HTTP resonse header: Content-Type: text/html; charset=ISO-8859-1
Use <META/> tag in the content: <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">

If the server does not use any of these, is there a way to recognize the encoding from the stream? Is there a perl module for that? Thanks.

Comment on Language recognition of web pages Select or Download Code

Replies are listed 'Best First'.
Re: Language recognition of web pages by gaal (Parson) on Aug 27, 2004 at 05:10 UTC
First note that anguage identification is different from charset identification. Several languages can often be expressed with the same charset. There's no Right Thing to do here, not only because eight bit encodings have overlapping characters. If you want to try a heuristic you can pick a language identification scoring function, then feed in the data and choose the best match. If the input might also be in UTF-8 (or other encodings), you have to feed it in more than once. You do have a shortcut here; some streams are not valid UTF-8 and you can tell that pretty quickly. One way to score langauge match is to look at bigram frequency. Another is to use compression analysis.	[reply]
Re: Language recognition of web pages by hardburn (Abbot) on Aug 26, 2004 at 20:58 UTC
Not really. The typical solution is to assume it's Latin-1 and hope for the best. "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.	[reply]
Re: Language recognition of web pages by iburrell (Chaplain) on Aug 27, 2004 at 17:08 UTC
It is easy to determine the default charset when it isn't marked because it is specified in the standards. For MIME, and text content types, the defualt charset is US-ASCII (RFC 2046). For HTTP, the default is ISO-8859-1 (RFC 2616). However, most browsers will interpret the charset based on local settings. I can set Firefox to use any encoding as the default and read unmarked Shift_JIS files if I want.	[reply]