A while back I wrote a script that solved a problem and a number of colleagues have found it to be useful so I thought I'd turn it into a module and release it to CPAN. You may be able to help me with the following questions:

Does something like this already exist on CPAN?
If not, what should I call it?

There are many use cases for the script but if I describe the original problem it will give you some context ...

The Problem

I had a Postgres database that contained plain ASCII data. I needed to convert the database to a Unicode encoding to support accented characters outside of the basic Latin-1 set. The process for converting the encoding of a Postgres database is:

write the DB out to a dump file
use a utility like iconv to convert the encoding of the dump file
create a new (empty) database - specifying the new encoding
restore the transcoded dump file into the new database

In my case step 2 was not necessary since I was converting from ASCII to UTF8 and ASCII is a subset of UTF8. So it really just boiled down to a dump and restore back into a database that had been created with the UTF8 encoding.

This is where my problems started. It turns out that the database did not just contain plain ASCII data. The Postgres 'SQLASCII' encoding basically just means take whatever bytes are given and store them in the DB. And apparently our application had been giving the database an interesting selection of bytes over the years. Originally our web frontend used the Apache default encoding of iso8859-1, later we fixed that so that it used utf-8. So originally accented characters were mostly arriving encoded as iso8859-1 bytes, but often included characters from windows machines using 'win-latin-1' or CP1252 (especially the so called 'smart quote' characters and em-dashes). After we fixed the web server config the non-ASCII data was coming in as UTF8 byte streams.

So it turns out I did need step 2 only I couldn't use iconv because it converts from one encoding on the input side to one encoding on the output and I had two or three encodings in my data dump.

The Solution

So I wrote a script called 'fix_latin' which we piped our dump file through. The bytes were examined and filtered as follows:

plain ASCII characters (0x00-0x7F) were passed through untouched
well-formed UTF-8 multi-byte characters were also passed through untouched
any remaining lone bytes (0x80-0xFF) were assumed to be CP1252 (being a superset of iso8859-1)

The script was used in a pipeline somewhat like this:

fix_latin < dump_file | psql -d database
[download]

The Short Story

So basically, the 'fix_latin' script is a filter taking input which may contain any mixture of ASCII, LATIN-1 (iso8859-1), WIN-LATIN-1 (CP1252) and UTF-8 encodings and producing UTF-8 as output.

The Proposal

So unless someone can point me at something on CPAN which already does this, I plan to rework the script into a module with essentially just one public function: fix_latin which will take a byte string and return a UTF-8 string.

The distribution will also include a simple command-line filter script which will apply the fix_latin function to each line of input.

My initial though on naming the module was 'Text::FixLatin'. It's possible that it might be more at home under the 'Encode' namespace (although from a Perl perspective it really 'decodes' bytes into Perl characters). I'm open to suggestions.

In reply to RFC: Text::FixLatin by grantm

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.