So how many bytes is that? 19,000 records doesn't seem like such a big number, really (unless each record is like a whole megabyte or something).

If it's a static set of data and you just need a one-shot transform to replace non-ascii with ascii, it wouldn't hurt to do a little diagnosis up front to see what you need to cover:

# concatenate all your records together into one data stream # and pipe it all through this perl command line: perl -ne 'tr/\x00-\x7f//d; $ch{$_}++ for (split//); END{printf("%x %d\ +n",ord,$ch{$_}) for (sort keys %ch)}' # this prints a histogram of non-ascii byte values
Sometimes this sort of diagnosis can reveal some unexpected properties (e.g. mistakes) in the data, especially for stuff that has been manually created in (and extracted from) proprietary file formats.

Example: if 0x93 and 0x94 are supposed to open and close double-quotes, do you get the same quantity of each? If not, maybe some of them mean something else, or maybe some records just happen to have unbalanced quotes (and then you need to decide or be told whether that matters...)


In reply to Re: Mass regsub on High-bit chars. by graff
in thread Mass regsub on High-bit chars. by abaxaba

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.