I've got a potato and I want to turn it into a tomato. This should be possible given that "tomato" looks so much like "potato". Please advise how to proceed.

If you can correct this mess by hand, it may be possible to go through all the possible variations (in millions of records!) and develop some heuristics that will allow the definition of a set of regexes to be used in a bunch of substitutions, or maybe develop a set of parsing rules. The critical problem, and I would think the chief sink of time and effort in this Quixotic quest, will be developing a robust unit testing framework to allow you to prove you really can spin straw into gold.

Your best bet: Whoever's sending you this junk, tell them you know where they live and they'd better start sending you valid data or else!

Please forgive the snarky tone of this post. I just want to make the point that there are some jobs best left undone, and even un-begun!

BTW: Does the final part of your example data

{'firstNameC' : 'Peter', 'lastName' : 'O'Toole', 'text' : "More text with diacritics' ]}
actually represent something you would see in your real-word data or is it just a cut/paste typo? If it's real, good luck!

Update: WRT the unit testing framework: Remember that you must cover both false negative cases (records that need to be fixed and are missed), and false positive cases (records that get "fixed" even though they were just fine to begin with, thus screwing them up). Remember also that if your fixer-upper script is 99.9% effective, you will still have, out of millions of records, thousands that are missed and still need fixing — perhaps by hand? Also: please ponder the notions "tarbaby", "quagmire" and "death march".


Give a man a fish:  <%-{-{-{-<


In reply to Re: Escaping quotes in JSON string by AnomalousMonk
in thread Escaping quotes in JSON string by HeadScratcher

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.