in reply to Mass regsub on High-bit chars.
If it's a static set of data and you just need a one-shot transform to replace non-ascii with ascii, it wouldn't hurt to do a little diagnosis up front to see what you need to cover:
Sometimes this sort of diagnosis can reveal some unexpected properties (e.g. mistakes) in the data, especially for stuff that has been manually created in (and extracted from) proprietary file formats.# concatenate all your records together into one data stream # and pipe it all through this perl command line: perl -ne 'tr/\x00-\x7f//d; $ch{$_}++ for (split//); END{printf("%x %d\ +n",ord,$ch{$_}) for (sort keys %ch)}' # this prints a histogram of non-ascii byte values
Example: if 0x93 and 0x94 are supposed to open and close double-quotes, do you get the same quantity of each? If not, maybe some of them mean something else, or maybe some records just happen to have unbalanced quotes (and then you need to decide or be told whether that matters...)
|
|---|