in reply to Re: Regex for MS Word Special Characters
in thread Regex for MS Word Special Characters

unfortunately that is not an option as it has already been in production for over five years
  • Comment on Re^2: Regex for MS Word Special Characters

Replies are listed 'Best First'.
Re^3: Regex for MS Word Special Characters
by pc88mxer (Vicar) on Apr 21, 2008 at 23:09 UTC
    I guess one solution would be to remove all non-latin1 representable characters. See this recent thread on HTML entities converted to Non-Latin-1 format....

    Otherwise, let's assume then that you can only store 8-bit character data in your database, and that you are currently only storing ASCII data (i.e. characters from 0-127). Then you could do something along these lines:

    1. When passing data to the database, use encode to encode the data to utf8.
    2. When reading data from the database, use decode to decode to Unicode code-points.
    Some example code:
    # Instead of: $sth->execute(@data); # use: use Encode; $sth->execute(map { Encode::encode('utf8', $_) } @data); # and in place of: my @row = $sth->fetchrow; # use: my @row = map { Encode::decode('utf8', $_) } $sth->fetchrow;
    Unfortunately, one really thorny issue is that there just too many ways to get data out of a database using DBI, i.e. fetchrow_*, select*_*, etc. If there was a way to install a data transformation filter in DBI, then might be a reasonable approach. Perhaps someone else knows if this is possible.