| [reply] |
thanks, I will look into that. I recently found a crude sort of workaround by using the hex code equivalent in a regular expression. However, 4-digit hex code will not work.
| [reply] |
I think you are better off finding a way to store and retrieve Unicode with your database. You could use a specific code page, but this is very limiting and non-standard. Moreover, it is something that every other application which uses the database will have know about.
| [reply] |
unfortunately that is not an option as it has already been in production for over five years
| [reply] |
I guess one solution would be to remove all non-latin1 representable characters. See this recent thread on HTML entities converted to Non-Latin-1 format....
Otherwise, let's assume then that you can only store 8-bit character data in your database, and that you are currently only storing ASCII data (i.e. characters from 0-127). Then you could do something along these lines:
- When passing data to the database, use encode to encode the data to utf8.
- When reading data from the database, use decode to decode to Unicode code-points.
Some example code:
# Instead of:
$sth->execute(@data);
# use:
use Encode;
$sth->execute(map { Encode::encode('utf8', $_) } @data);
# and in place of:
my @row = $sth->fetchrow;
# use:
my @row = map { Encode::decode('utf8', $_) } $sth->fetchrow;
Unfortunately, one really thorny issue is that there just too many ways to get data out of a database using DBI, i.e. fetchrow_*, select*_*, etc. If there was a way to install a data transformation filter in DBI, then might be a reasonable approach. Perhaps someone else knows if this is possible.
| [reply] [d/l] [select] |
Could you define what "Microsoft Word's special characters" exactly are, e.g. list them.
You can filter character list with [ ] so if A T Y I would be special characters (which of course there aren't, just as an example), then you could remove them from a string with the following simple regex:
$string =~ s/[ATYI]//g;
If you have the special characters as octal code you could write this like this:
$string =~ s/[\123\124\145]//g;
(The numbers here are just random example numbers).
Please tell me if I missed something or misunderstood you.
| [reply] [d/l] [select] |
Yes, I believe that you did misunderstand my question a bit as my question is basically the same as yours. I know that I can use a regular expression just as you mentioned. my problem is using the correct representation for the MS Word characters such as smart quotes. For now I am trying to use hex code to substitute them in combination, but it is not working when I use a 4-digit hex code.
| [reply] |
Put together a small code sample derived from the code you are having trouble with and some sample text. Show us what you get after running the sample text through the sample code and what you expected to get. We may then be able to help.
Perl is environmentally friendly - it saves trees
| [reply] |