Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

[OT] SQL "on duplicate key" custom action

by haukex (Archbishop)
on Oct 21, 2019 at 07:18 UTC ( [id://11107746]=perlquestion: print w/replies, xml ) Need Help??

haukex has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

My SQL-fu is too weak for this one and Google hasn't helped yet. I'm inserting data records into a table, and there will be lots of duplicates: if the primary key exists, it should silently ignore the record if all the other columns match, but throw an error if they don't. Basically, of the following, the second should be ignored and the third should throw an error. I feel like there should be an easy solution but I'm not seeing it yet.

INSERT INTO records (PriKey, Foo, Bar) VALUES (1, 'Hello', World'); INSERT INTO records (PriKey, Foo, Bar) VALUES (1, 'Hello', World'); INSERT INTO records (PriKey, Foo, Bar) VALUES (1, 'Hello', Quz');

Also, related: Since I can choose the DB, what is your opinion on MySQL vs. Postgres? I'm thinking I'll probably use the latter but would like to hear any opinions.

Replies are listed 'Best First'.
Re: [OT] SQL "on duplicate key" custom action
by Corion (Patriarch) on Oct 21, 2019 at 07:36 UTC

    I think the "best" approach (if that is supported by your DBMS) are ON UPDATE triggers, or if that import is a one time thing only, creating a unique key across all columns:

    create or replace function skip_insert() returns trigger language plpg +sql as $$ begin return null; end $$; create or replace function raise_exception() returns trigger language +plpgsql as $$ begin RAISE EXCEPTION 'UPDATE changing row values not allowed'; end $$; CREATE TRIGGER mytrigger ON records INSTEAD OF UPDATE WHEN old.PriKey = new.PriKey and old.Foo=new.Foo and old.Bar=new.B +ar FOR EACH ROW EXECUTE PROCEDURE skip_insert() CREATE TRIGGER mytrigger ON records INSTEAD OF UPDATE WHEN old.PriKey = new.PriKey and (old.Foo!=new.Foo or old.Bar!=new +.Bar) FOR EACH ROW EXECUTE PROCEDURE raise_exception()

    ... but this approach does not handle NULL values.

      Here's the solution I've ended up with, based on your suggestion:

      CREATE FUNCTION skip_update() RETURNS trigger AS $$ BEGIN RETURN NULL; END; $$ LANGUAGE plpgsql; CREATE FUNCTION fail_update() RETURNS trigger AS $$ BEGIN RAISE EXCEPTION 'can''t modify row values for existing primary + key'; END; $$ LANGUAGE plpgsql; CREATE TABLE Dummy_Test ( Timestamp TIMESTAMP WITH TIME ZONE PRIMARY KEY, Foo TEXT, Bar TEXT ); CREATE TRIGGER Dummy_Test_dupe BEFORE UPDATE ON Dummy_Test FOR EACH RO +W WHEN ( OLD.Timestamp IS NOT DISTINCT FROM NEW.Timestamp AND OLD.Fo +o IS NOT DISTINCT FROM NEW.Foo AND OLD.Bar IS NOT DISTINCT FROM NEW.B +ar ) EXECUTE FUNCTION skip_update(); CREATE TRIGGER Dummy_Test_modify BEFORE UPDATE ON Dummy_Test FOR EACH +ROW WHEN ( OLD.Foo IS DISTINCT FROM NEW.Foo OR OLD.Bar IS DISTINCT FRO +M NEW.Bar ) EXECUTE FUNCTION fail_update(); INSERT INTO Dummy_Test (Timestamp,Foo,Bar) VALUES ('2020-01-02 12:34', +'Hello','World') ON CONFLICT ON CONSTRAINT Dummy_Test_pkey DO UPDATE +SET Foo=EXCLUDED.Foo, Bar=EXCLUDED.Bar; INSERT INTO Dummy_Test (Timestamp,Foo,Bar) VALUES ('2020-01-02 12:34', +'Hello','World') ON CONFLICT ON CONSTRAINT Dummy_Test_pkey DO UPDATE +SET Foo=EXCLUDED.Foo, Bar=EXCLUDED.Bar; INSERT INTO Dummy_Test (Timestamp,Foo,Bar) VALUES ('2020-01-02 12:34', +'Hello','abcde') ON CONFLICT ON CONSTRAINT Dummy_Test_pkey DO UPDATE +SET Foo=EXCLUDED.Foo, Bar=EXCLUDED.Bar;

      The last statement will fail, which is exactly what I wanted. The function FUNCTION skip_update() and TRIGGER Dummy_Test_dupe really only prevent the UPDATE from happening, they can be omitted - I haven't yet tested whether it's more performant to have the TRIGGER or the UPDATE fire. In any case, thank you very much!

        Just for the record, here's the same thing for SQLite. As above, the TRIGGER Dummy_Test_dupe is optional and I haven't tested the performance difference yet.

        use warnings; use strict; use DBI; my $dbh = DBI->connect("dbi:SQLite:dbname=:memory:", undef, undef, { RaiseError=>1, AutoCommit=>1 } ); $dbh->do(<<'ENDSQL'); CREATE TABLE Dummy_Test ( Timestamp INTEGER PRIMARY KEY, Foo TEXT, Bar TEXT ); ENDSQL $dbh->do(<<'ENDSQL'); CREATE TRIGGER Dummy_Test_dupe BEFORE UPDATE ON Dummy_Test FOR EACH RO +W WHEN ( OLD.Timestamp IS NEW.Timestamp AND OLD.Foo IS NEW.Foo AND O +LD.Bar IS NEW.Bar ) BEGIN SELECT RAISE(IGNORE); END; ENDSQL $dbh->do(<<'ENDSQL'); CREATE TRIGGER Dummy_Test_modify BEFORE UPDATE ON Dummy_Test FOR EACH +ROW WHEN ( OLD.Foo IS NOT NEW.Foo OR OLD.Bar IS NOT NEW.Bar ) BEGIN SELECT RAISE(ABORT, "same Timestamp but different values"); +END; ENDSQL my $in = $dbh->prepare(<<'ENDSQL'); INSERT INTO Dummy_Test (Timestamp,Foo,Bar) VALUES (?,?,?) ON CONFLICT (Timestamp) DO UPDATE SET Foo=EXCLUDED.Foo, Bar=EXCLUD +ED.Bar; ENDSQL $in->execute(12345,'Hello','World'); $in->execute(12345,'Hello','World'); my $e; eval { $in->execute(12345,'Hello','abcde'); 1 } or do { $e = $@ }; if ( defined $e ) { print "Third insert failed as expected: $e" } else { die "Third insert didn't fail as expected" }
        The function FUNCTION skip_update() and TRIGGER Dummy_Test_dupe really only prevent the UPDATE from happening, they can be omitted - I haven't yet tested whether it's more performant to have the TRIGGER or the UPDATE fire.

        Interestingly, it made a small but consistent difference. I imported the same dataset (73613 INSERTs) three times into a previously populated database, and without the TRIGGER, i.e. with the UPDATE, it took an average 91.540s, and with the TRIGGER, i.e. preventing the UPDATE, it took 99.368s (each of these is only +/- about 2s), i.e. a difference of roughly 0.1ms per record. Not a big deal here, and not a rigorous test, but still interesting.

        Update: Importing into a previously empty database revealed only a tiny difference: overall the import was roughly 10sec faster, again UPDATE beating TRIGGER with a difference of 0.17ms per record. Again, not really significant.

      Do you perhaps have a tested version? I can't get this to run.

      The trigger functions are all right, it seems, but I don't see how to create the triggers themselves onto the table 'records' so that they react in the desired way. (For one thing, you can't do an 'INSTEAD OF UPDATE' on a table, and when you use a BEFORE INSERT or an AFTER INSERT trigger, this old. and new. business is not going to work.)

        No, sorry - I don't have a tested version, and I'm not on the machine where I did the searches.

        I worked from the Postgres 12 documentation and some Stackoverflow response, but I don't find them now :( I think it should be a before update trigger and the statement would be insert or update maybe. But again, I didn't test any of this, sorry.

      Thanks very much! It's not just a single import, it'll be a large import roughly once a month. So I understand you'd suggest a trigger over a unique key, may I ask why? (Performance?)

        An index across the complete table will eat up lots of disk space, and also will make each insert slow(er). Think of an index as (somewhat structured) hash keys, where you can also quickly look up strings starting with a specific string.

        If your table is largely read-only and you have the disk space to spare, the index will speed up queries maybe a bit more than an index only on the primary keys, because the DB can satisfy the query completely from the index without hitting the table storage.

        The triggers will slow down all UPDATE statements but leave SELECT untouched and also will not use additional storage.

Re: [OT] SQL "on duplicate key" custom action
by daxim (Curate) on Oct 21, 2019 at 09:30 UTC

      Thanks for all the references! Yeah, I'll go with Postgres then :-)

Re: [OT] SQL "on duplicate key" custom action
by erix (Prior) on Oct 21, 2019 at 09:43 UTC

    if the primary key exists, it should silently ignore the record if all the other columns match, but throw an error if they don't.

    I'd think a unique index including any extra columns would make this version of your INSERT work:

    drop table if exists testtable; create table testtable (pk int primary key, foo text, bar te +xt); create unique index foobar_idx on testtable (pk, foo, bar); INSERT INTO testtable (pk, foo, bar) VALUES (1, 'Hello', 'World') ON C +ONFLICT (pk,foo,bar) DO NOTHING; INSERT INTO testtable (pk, foo, bar) VALUES (1, 'Hello', 'World') ON C +ONFLICT (pk,foo,bar) DO NOTHING; INSERT INTO testtable (pk, foo, bar) VALUES (1, 'Hello', 'Quz' ) ON C +ONFLICT (pk,foo,bar) DO NOTHING;

    which yields:

    drop table if exists testtable; DROP TABLE create table testtable (pk int primary key, foo text, bar te +xt); CREATE TABLE create unique index foobar_idx on testtable (pk, foo, bar); CREATE INDEX INSERT INTO testtable (pk, foo, bar) VALUES (1, 'Hello', 'World') ON C +ONFLICT (pk,foo,bar) DO NOTHING; INSERT 0 1 INSERT INTO testtable (pk, foo, bar) VALUES (1, 'Hello', 'World') ON C +ONFLICT (pk,foo,bar) DO NOTHING; INSERT 0 0 INSERT INTO testtable (pk, foo, bar) VALUES (1, 'Hello', 'Quz' ) ON C +ONFLICT (pk,foo,bar) DO NOTHING; ERROR: duplicate key value violates unique constraint "testtable_pkey +" DETAIL: Key (pk)=(1) already exists.

    Postgres 12.1 (but this works from 9.5.x onward)

    Manual PostgreSQL 12.1: INSERT # SQL-ON-CONFLICT

    update: The first INSERT should be the same, of course, fixed. Also, removed vestiges of the original bash-version.

    update 2: Re 'MySQL vs. Postgres': I think open-sourced and completely-free PostgreSQL is a better/safer choice than proprietary Oracle MySQL. For one thing, over time Pg will get better -- even better than Oracle -- but how could Oracle ever allow cheapo MySQL to overtake its venerable and expensive Oracle database product?

      Thank you very much for your suggestion; it turns out that some of my tables have a lot of columns, and I ran into "ERROR: cannot use more than 32 columns in an index". Since Corion's suggestion should use less disk space I think, I'll be going with my version of that.

      Thank you for this, I figured there was probably a fairly easy way to do it that I was missing! I'll try and compare it with Corion's trigger suggestion in terms of disk space and speed.

        I tried to get that trigger-version working but I could not (well, not in a satisfactory way). I'm beginning to think it's not possible -- so if you hit upon a trigger-solution, I'd very much like to know. Thanks!

Re: [OT] SQL "on duplicate key" custom action
by cavac (Parson) on Oct 21, 2019 at 15:12 UTC

    If you have the choice, choose PostgreSQL. It might not matter in the beginning, but projects tend to grow in complexity and PostgreSQL is a much better fit when it comes to complex data handling tasks and keeping your data integrity intact.

    Also, if you like PerlMonks, you will most likely enjoy working with the PostgreSQL community.

    perl -e 'use MIME::Base64; print decode_base64("4pmsIE5ldmVyIGdvbm5hIGdpdmUgeW91IHVwCiAgTmV2ZXIgZ29ubmEgbGV0IHlvdSBkb3duLi4uIOKZqwo=");'
      Also, if you like PerlMonks, you will most likely enjoy working with the PostgreSQL community.

      That sounds nice :-) What's the Postgres equivalent to PerlMonks?

        There are a lot of places to look at. https://www.postgresql.org/community/ is as good a place to start as any. Especially the mailing lists (yup, those things still exist) are very helpful. The mailing lists are pretty much the equivalent of PerlMonks in my opinion

        If you need help semi-urgent or just want to hang out, #postgresql on irc.freenode.net is also a good place. Saved my bacon on a couple of occasions.

        Last, but not least, if you need commercial PostgreSQL help for a big project, PostgreSQL got you covered: https://www.postgresql.org/support/professional_support/

        perl -e 'use MIME::Base64; print decode_base64("4pmsIE5ldmVyIGdvbm5hIGdpdmUgeW91IHVwCiAgTmV2ZXIgZ29ubmEgbGV0IHlvdSBkb3duLi4uIOKZqwo=");'

        There's IRC: the #postgresql channel on libera.chat, if you like that kind of thing. It's especially like latterday perlmonks in that there are sometimes long silences, but then suddenly also intensely interesting conversations, use cases, problem cases -- and, even when silent, there is always much expertise.

        [1] https://www.postgresql.org/community/irc/

        Q> What's the Postgres equivalent to PerlMonks?

        A> EXPLAIN
Re: [OT] SQL "on duplicate key" custom action
by perlfan (Vicar) on Oct 13, 2021 at 01:02 UTC
    It depends on what DB, but it's basically the following to "ignore" it,
    INSERT INTO records (PriKey, Foo, Bar) VALUES (1, 'Hello', World') ON DUPLICATE KEY UPDATE Foo=Foo;
    Some track the attempts, though; using the equivalent to a column named Attempts. But I am not sure how useful that is in practice tbh:
    INSERT INTO records (PriKey, Foo, Bar) VALUES (1, 'Hello', World') ON DUPLICATE KEY UPDATE Attempts=Attempts+1;
    > Since I can choose the DB, what is your opinion on MySQL vs. Postgres?

    It really depends; frankly I don't do anything these days that even requires a DB server so usually I go with DBD::SQLite. That has it's own things to deal with (forks, etc) but it makes my life a lot easier not to have to maintain DB servers and all that stuff. YMMV based on the application though.

    Most relevant from the YMMV link,

    SQLite does not compete with client/server databases. SQLite competes with fopen().

      It depends on what DB, but it's basically the following to "ignore" it, INSERT INTO records (PriKey, Foo, Bar) VALUES (1, 'Hello', World') ON DUPLICATE KEY UPDATE Foo=Foo;

      How does that address my question?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11107746]
Approved by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2024-03-28 08:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found