I'm trying to extract data from Excel spreadsheets. I'm pretty sure most of the data in these spreadsheets is encoded as ISO 8859-1. Because this data is going into an XML file, and from there into a MySQL database, I'm trying to do The Right Thing and coerce the data into Unicode as early as possible.
Here's the problem:
Some of the data appears to be in a different encoding. As a specific example, it sometimes contains a "—" (that's a em-dash character, hopefully) which gets extracted as ^S (CTRL-S, try typing that in an xterm :) ). To my knowledge, this isn't in ISO 8859-1.
At the moment, I'm writing character-specific code that processes each character on a case-by-case basis. E.g. I'm converting Excel's rendition of em dash to "--". Is this the best way of doing it? should I be decodeing from a different character set?
In reply to What to do when converting Excel-supplied data to Unicode by davis
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |