anonymized user 468275 has asked for the wisdom of the Perl Monks concerning the following question:

I have a CSV file which contains a few million lines of business registration data. I created a table legal_entity in postgresql 9.6 which has a unique index comprised of: entity subtype (a numeric foreign key to a table containing these), country of registration (another numeric foreign key to my country table) and the name of the business, this being to prevent duplicate insertions of legal entities with the same combo of country of registration, business type and business name.

Then in Perl, I check for the duplicate before insertion thus:

$incname =~ s/\'/\'\'/g; # change syntax for embedded quotes in Po +stgres $sql = sprintf( 'select count(le_id) ' . 'from legal_entity ' . 'where le_corid = 586 ' # this routine is for data from a +particular registrar . 'and le_lesid = %s ' . 'and le_inc_name = %s', $lesid, "'" . $incname . "'" ); $sth = $dbh->prepare($sql); $sth->execute; my ($count_le) = $sth->fetchrow_array; if ($count_le) { print "$. - existing incname\n"; next LINE; }
This works for most cases, including where there are apostrophes in the business name. But the first time it goes wrong is when the business name is: "1-6 CHAPMANΓÇÖS END MANAGEMENT COMPANY LIMITED" (note this data is from Companies House so is the official business name, so I can't modify it to make my life easier. Basically they will type in any old rubbish when registering a business and it's cast in tablets of stone

The business name was inserted in an earlier run of the Perl script and the above select does not match the string in the database and so control passes after the above code where the database rejects the duplicate business name and so I have to crash and roll back.

I can't simply handle the database error because I will need to match such business names when I write more code. So now is the time to figure out how to ensure I can match strings containing alphabetic diacritics. Clearly, Postgres has stored it differently from how it matches it with select when there are alphabetic diacritics in the string (actually it's character varying of length 256 in Postgres).

Any suggestions of how to match such strings after insertion (when they are exactly what I inserted!) will be most welcome!!!

One world, one people

Replies are listed 'Best First'.
Re: Matching alphabetic diacritics with Perl and Postgresql
by poj (Abbot) on Jun 03, 2017 at 20:43 UTC

    Have you tried using placeholders ?

    my $sql = 'SELECT COUNT(le_id) FROM legal_entity WHERE le_corid = ? AND le_lesid = ? AND le_inc_name = ?'; my $sth = $dbh->prepare($sql); $sth->execute(586,$lesid,$incname); my ($count_le) = $sth->fetchrow_array;
    poj
Re: Matching alphabetic diacritics with Perl and Postgresql
by 1nickt (Canon) on Jun 03, 2017 at 19:12 UTC

    Hi, I'm not a Pg user but:
    What's the character encoding used for the DB table? Does it match what the client_encoding variable is set to? Is your CSV data read in as UTF-8? It sounds like part of your system is not set up to handle high unicode characters (which is what I think you mean by "alphabetic diacritics").

    Also, why are you checking for dupes in the Perl code? The database should handle that, with a clause like "if not exists" or something like that (I don't know if Pg, like MySQL, offers "insert ... on duplicate key update ..." syntax).


    The way forward always starts with a minimal test.
      Update: the data arrives correctly in Postgres as '1-6 Chapman's End Management. So now it's a Perl-only problem with the encoding probably as you suggest. But I can't go changing to where not exists until I fix it of course.

      One world, one people

      And you are the winner! open my $fh, "<encode(UTF8)", $csvFile fixed it so that the queries now work. The owners of the original data were using UTF8 to put apostrophes in their database or perhaps to write them in the CSV file. Writing them to my own database as ASCII was OK, but subsequently RSE's would only work if they are also constructed using UTF8. So provided Perl knows it's UTF8 from the outset, DBI constructs the queries correctly.

      One world, one people

        Two points.

        • That suggestion is not what you mean, the correct syntax includes a colon and has different spelling: opne my $fh, "<:encoding(utf-8)"
        • Use a CSV parser that handles UTF-8, like Text::CSV_XS my $aoh = csv (in => "file.csv", encoding => "utf-8");

        Enjoy, Have FUN! H.Merijn
Re: Matching alphabetic diacritics with Perl and Postgresql
by chacham (Prior) on Jun 03, 2017 at 19:15 UTC

    Aside from using dynamic SQL and incurring its ills, running a query to check for duplicates when the actual insertion will be another query, allows a duplicate to be inserted between the two statements, unless the table is locked. Hence, it is generally recommended to use an atomic statement, that is, insert into...where not exists(); This guarantees the record is not there at the time of insertion itself.

    Also, if you have a CSV file with millions of records, it would likely be best to import the data using something like COPY, and either put it directly into the table you want, or use a staging table with no constraints, and then a simple insert into target(cola, colb, ...) select cola, colb, ... from staging where not exists(select * from target...).

      Chicken and egg: I first have to ensure what I insert can be matched before I can allow other users in, at which point yes, I am then in a position to use a where not exists clause. Until then, the where not exists will not help me.

      Regarding COPY - I have constraints from other relations than the one with the issue:

      Location -> lelo <- legal_entity

      lelo_type -> lelo

      and some others that have to be dealt with after. So I have to generate unique location ids, unique le_ids and join them in lelo which has unique ids per type, location and entity. (at this point, type of lelo is registered office, although other types will come as the database develops). COPY cannot be used when building related data simultaneously and needing a single transaction for all related data in different tables especially where fk constraints are needed.

      One world, one people

        I see you have thought this one out. I would still like to suggest that you add a staging table and copy the data into it before processing, and then do all you processing in sql. This should avoid the issues as the database itself will be doing the comparisons, and with where not exists, at the same time as the insertion.

        If there are multiple steps, a stored procedure can be used, or a series stored procedures called from another one, should you want separate entry points.

Re: Matching alphabetic diacritics with Perl and Postgresql
by marinersk (Priest) on Jun 04, 2017 at 10:10 UTC

    What is the column definition in Postgres?

    It's a long shot, but I ask because:

    1. But the first time it goes wrong is when the business name is: "1-6 CH +APMAN&#915;ÇÖS END MANAGEMENT COMPANY LIMITED"
    2. the data arrives correctly in Postgres as '1-6 Chapman's End Managemen +t.
    3. Clearly, Postgres has stored it differently from how it matches it wit +h select when there are alphabetic diacritics in the string (actually + it's character varying of length 256 in Postgres).

    Which immediately makes me think of VARCHARvs. NVARCHAR from the SQL Server space. Is there something in Postgres that needs doing to handle Unicode?

Re: Matching alphabetic diacritics with Perl and Postgresql
by erix (Prior) on Jun 04, 2017 at 05:17 UTC

    Add an indexed array column (of type text[]) to contain approved variants?

    Or add an indexed array column (of type text[]) of md5's to contain the md5's of approved variants? ( postgres md5 documentation ).

    (Admittedly, I don't quite get the problem: the code you posted just works for me, with the mangled company name)