comment on

Thanks for this. You're right. I am not actually too too worried about the data we already have, as it is all latin1. So, you're right in that converting it all to utf-8 is trivial. My concern is the increasing tendency for the company to internationalize, so it is just a matter of time before we start getting characters that are not valid latin1. I had thought that until I got my database converted, I might handle it by converting the utf-8 data into something that could be stored in a latin1 database, and convert that back to utf-8 when it is to be displayed subsequently (a temporary procedure until I finish converting my database - but I suppose use of that might only make the transition harder in terms of having all the data in utf-8 eventually as the encoded data would have to be unencoded).

On the one hand, I have to change some of our forms to utf-8 encoding, because, to reduce data entry errors, I have to use the locales packages to display countries and smaller administrative units in chained dropdown boxes, and these do not display correctly unless the web page is utf-8. On the other hand, some of the data comes from a feed from another company, and they are entirely utf-8. A colleague of mine, dealing with the same feed, delt with it by determining what utf-8 character, which was not a valid latin1 character, was found in the feed at a given time, and he used that info to construct a regex to filter out utf-8 characters that could not be accomodated in his latin1 database. Obviously, I find his approach distasteful at best because it discards data, and the user can never see exactly what he had originally entered; and his code grew increasingly ugly as it accumulated dozens of lines applying one regex filter after another. In both cases, occurance of a utf-8 character that is not a valid latin1 character causes the SQL that inserts it into the db to fail, and that in turn leads to hours of work to determine precisely what data didn't make it into the DB, and to 'edit' the data so it could be inserted into the DB in some form.

I want to do the opposite of what my colleague did, and just convert the whole thing, eventually, to utf-8, so that the db holds, and the app displays, the data exactly as entered.

Now this raises a question as to a) how do I determine the range of acceptable utf-8 characters you speak of (and express that in code), and b) how do I do I express such a constraint in my documentation, so that integrators that code to my API know to use only characters in the acceptable range? I'd also have to put something on my web-forms to indicate to the user to not bother entering characters outside the acceptable range; but how do I do that in a way that is readily understood by most users? The last thing I want to happen is either that my code dies a nasty death because someone entered data I can't handle or that a user enters such data and gets either no result or gibberish back. It is better that the user knows ahead of time just not to bother entering certain sets of characters. I suppose if the 'filter' you speak of can be used within Data::FormValidator,JavaScript::DataFormValidator could use the same rule to prevent the users of my forms from entering data I can't handle. But I'd still have to document the constraint for the users of my API.

Thanks

Ted

In reply to Re^2: How to generate random sequence of UTF-8 characters by ted.byers
in thread How to generate random sequence of UTF-8 characters by ted.byers

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Pathologically Eclectic Rubbish Lister
	PerlMonks