Regex for MS Word Special Characters

omg_wtf_lol has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex for MS Word Special Characters by wfsp (Abbot) on Apr 21, 2008 at 16:06 UTC
If you want to convert cp1252 to utf8 you could use Encode as described by graff	[reply]
Re^2: Regex for MS Word Special Characters by omg_wtf_lol (Initiate) on Apr 21, 2008 at 19:38 UTC
thanks, I will look into that. I recently found a crude sort of workaround by using the hex code equivalent in a regular expression. However, 4-digit hex code will not work.	[reply]
Re: Regex for MS Word Special Characters by pc88mxer (Vicar) on Apr 21, 2008 at 17:20 UTC
I think you are better off finding a way to store and retrieve Unicode with your database. You could use a specific code page, but this is very limiting and non-standard. Moreover, it is something that every other application which uses the database will have know about.	[reply]
Re^2: Regex for MS Word Special Characters by Anonymous Monk on Apr 21, 2008 at 19:31 UTC
unfortunately that is not an option as it has already been in production for over five years	[reply]
Re^3: Regex for MS Word Special Characters by pc88mxer (Vicar) on Apr 21, 2008 at 23:09 UTC
I guess one solution would be to remove all non-latin1 representable characters. See this recent thread on HTML entities converted to Non-Latin-1 format.... Otherwise, let's assume then that you can only store 8-bit character data in your database, and that you are currently only storing ASCII data (i.e. characters from 0-127). Then you could do something along these lines: When passing data to the database, use `encode` to encode the data to utf8. When reading data from the database, use `decode` to decode to Unicode code-points. Some example code: `# Instead of: $sth->execute(@data); # use: use Encode; $sth->execute(map { Encode::encode('utf8', $_) } @data); # and in place of: my @row = $sth->fetchrow; # use: my @row = map { Encode::decode('utf8', $_) } $sth->fetchrow;` [download] Unfortunately, one really thorny issue is that there just too many ways to get data out of a database using DBI, i.e. `fetchrow_`, `select_*`, etc. If there was a way to install a data transformation filter in DBI, then might be a reasonable approach. Perhaps someone else knows if this is possible.	[reply] [d/l] [select]
Re: Regex for MS Word Special Characters by mscharrer (Hermit) on Apr 21, 2008 at 16:04 UTC
Could you define what "Microsoft Word's special characters" exactly are, e.g. list them. You can filter character list with `[ ]` so if A T Y I would be special characters (which of course there aren't, just as an example), then you could remove them from a string with the following simple regex: `$string =~ s/[ATYI]//g;` [download] If you have the special characters as octal code you could write this like this: `$string =~ s/[\123\124\145]//g;` [download] (The numbers here are just random example numbers). Please tell me if I missed something or misunderstood you.	[reply] [d/l] [select]
Re^2: Regex for MS Word Special Characters by omg_wtf_lol (Initiate) on Apr 21, 2008 at 19:43 UTC
Yes, I believe that you did misunderstand my question a bit as my question is basically the same as yours. I know that I can use a regular expression just as you mentioned. my problem is using the correct representation for the MS Word characters such as smart quotes. For now I am trying to use hex code to substitute them in combination, but it is not working when I use a 4-digit hex code.	[reply]
Re: Regex for MS Word Special Characters by GrandFather (Saint) on Apr 21, 2008 at 20:40 UTC
Put together a small code sample derived from the code you are having trouble with and some sample text. Show us what you get after running the sample text through the sample code and what you expected to get. We may then be able to help. Perl is environmentally friendly - it saves trees	[reply]