Final Report

Thanks for the responses, which fell into three groups:

The histogram is a classic recipe. When I ran kennethk's implementation against my big file, I added a printout showing all the character counts as well as the unused characters I'd been looking for. Although pipe occurred 43 times and tilde occurred once, there were in fact three printable ASCII characters that were never used.

The job ended up taking 79 minutes. Having heard that hash lookups are expensive, I was attracted by almut's suggestion to put the histogram in an array instead of a hash. That modification ran in 77 minutes.

Either the hash mechanism isn't that expensive after all, or a hash whose keys are single ASCII characters somehow achieves the same performance as an array.

The way to do this job fast is to quit looking at characters that have already been seen. I ran kennethk's correction (using quotemeta) to almut's illustration of how to dynamically generate a character class from a list, and it took only a couple of minutes (I didn't bother to put it in a harness to get an exact timing).

Thanks, finally, to all who pointed out that the solution to this puzzle has no business value. What I didn't mention was that we're writing a file to be read by Microsoft SQL Server Integration Services (SSIS). So one of the CSV formats is probably the way to go. My own preference had been to just use pack and generate a fixed-width file, but our SSIS developers think reading fixed-width data is too much trouble. I'm planning to spend the rest of the weekend Googling for ways in which SSIS might learn to read a configuration spec and unpack fixed-width data as easily as I know Perl can.


In reply to Re: Find what characters never appear by Narveson
in thread Find what characters never appear by Narveson

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.