I see. I'll just add my 2 cents for your perusal.

Batch processing with loaded data is almost always better in the database simply because it is made for it. It ends up being faster because systems are not switched, the database can do things in batches (as opposed to line by line) and character issues are unlikely to be coming up. So, before you even start, you're already ahead.

I'm relatively sure you can do it a few statements too, if done correctly. I have done things like this before, and on one job, i created a batch table to make record (and time) the different steps, allowing multiple batches to be done at the same time (because the steps do not have anything to do with the batches, per se).

If there are common misspellings, i would likely use a translation table to change misspellings to the used spelling, or move a country to the next step. A small query would do that (perhaps as part of a larger query). Anything not found would be flagged for human perusal, which would happen less and less with each run. Having the translation table in the database is likely better for managing it anyway, and you can even keep fun numbers like how often each shows up.

To remove spaces and new lines, postgres has replace functions, including regex. That's one very simple query.

In my experience, the sql code is smaller, not larger, if written correctly. However, if you already have it written in perl, there's no reason to reinvent the wheel, unless a large problem comes up, or you want/need processing to take (what might be) a fraction of the time.


In reply to Re^5: Matching alphabetic diacritics with Perl and Postgresql by chacham
in thread Matching alphabetic diacritics with Perl and Postgresql by anonymized user 468275

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.