in reply to Matching alphabetic diacritics with Perl and Postgresql

Hi, I'm not a Pg user but:
What's the character encoding used for the DB table? Does it match what the client_encoding variable is set to? Is your CSV data read in as UTF-8? It sounds like part of your system is not set up to handle high unicode characters (which is what I think you mean by "alphabetic diacritics").

Also, why are you checking for dupes in the Perl code? The database should handle that, with a clause like "if not exists" or something like that (I don't know if Pg, like MySQL, offers "insert ... on duplicate key update ..." syntax).


The way forward always starts with a minimal test.

Replies are listed 'Best First'.
Re^2: Matching alphabetic diacritics with Perl and Postgresql
by anonymized user 468275 (Curate) on Jun 04, 2017 at 11:29 UTC
    And you are the winner! open my $fh, "<encode(UTF8)", $csvFile fixed it so that the queries now work. The owners of the original data were using UTF8 to put apostrophes in their database or perhaps to write them in the CSV file. Writing them to my own database as ASCII was OK, but subsequently RSE's would only work if they are also constructed using UTF8. So provided Perl knows it's UTF8 from the outset, DBI constructs the queries correctly.

    One world, one people

      Two points.

      • That suggestion is not what you mean, the correct syntax includes a colon and has different spelling: opne my $fh, "<:encoding(utf-8)"
      • Use a CSV parser that handles UTF-8, like Text::CSV_XS my $aoh = csv (in => "file.csv", encoding => "utf-8");

      Enjoy, Have FUN! H.Merijn
        You are right - I did code it correctly in the .pl, but not in the post here (just (mis-typed) it in from memory). re Text::CSV, that's what I did at first but having switched to an open and read to be able to debug my draft-code issues with clarity, there is no reason to switch back to Text::CSV given that the csv file used is predictable enough to remove first and last chars and then split /\"\,\s*\"/. You could argue that this is a "not invented here" approach, but I am even more loth to use CPAN sledegehammers to crack tiny little nuts where a few characters are all that are needed to avoid loading a module. Think: performance! Some cases are less obvious whether to use the CPAN module, but this one seems clear enough, although I will move it to a utility module where it can be readily replaced with a use of Text::CSV if circumstances change.

        One world, one people

Re^2: Matching alphabetic diacritics with Perl and Postgresql
by anonymized user 468275 (Curate) on Jun 03, 2017 at 19:48 UTC
    Update: the data arrives correctly in Postgres as '1-6 Chapman's End Management. So now it's a Perl-only problem with the encoding probably as you suggest. But I can't go changing to where not exists until I fix it of course.

    One world, one people