in reply to Documentation of REGEXP support in DBD::SQLite?

See REGEXP-function of DBD::SQLite, emphasis added.

Update

Short version: Perl regexes are hooked into sqlite using a callback mechanism. But filtering row-wise. That's much slower than a native implementation.

Cheers Rolf
(addicted to the Perl Programming Language :)
see Wikisyntax for the Monastery

Replies are listed 'Best First'.
Re^2: Documentation of REGEXP support in DBD::SQLite?
by ibm1620 (Hermit) on Nov 14, 2024 at 20:16 UTC
    Thank you for finding that for me!

    Perl regexes are hooked into sqlite using a callback mechanism. But filtering row-wise. That's much slower than a native implementation.

    I'm not quite clear on what you mean by "native implementation" - would linking in libpcre2 (or another) as a custom function be expected to perform better? Or would I be better off using a different DBMS with built-in REGEXP support such as mySQL?

    (It does appear that both mySQL and SQLite REGEXP can't use the index and have to search the data itself.)

      I don't know if they perform better. You have to test it.

      Native means for me without call overhead and efficiently compiled (subset of) regexes.

      Regarding indexes, I can only imagine a small subset of regexes capable to profit from them, unless a lot of special case optimisation was implemented¹.

      It should be quite complicated to achieve this with a pluggable extension...

      But again I don't know ... This you should better ask at a DBM-board.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

      ¹) I'm not even sure a substring search with LIKE %substr% can take advantage from the index, the differences in your benchmark are not in magnitudes, this could be easily explained with "call overhead and efficiently compiled code".

      A real "index search" should be dramatically faster than just factor 4.

      And, as a side note, your regexes were much more complicated than a substr search. Apples and oranges...

        Regular expressions can be implemented in databases with quite a bit of success. I use them for some stuff in PostgreSQL and it is blazingly fast. It shouldn't matter for complicated the regular expressions are, the simple fact that the database software only has to read a (relatively) small index file from disk instead of everything in the table (full table scan) should still speed up the search. And many databases also keep frequently used indexes in RAM.

        Generally, i think of SQLite more of a desktop-type database that is integrated into a single application with a relatively small amount of data. For serious data crunching, i always choose a serious standalone database like PostgreSQL (hey, it's the same price as SQLite).

        PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
        Also check out my sisters artwork and my weekly webcomics
      mysql or something with builtin regexps would surely be at least somewhat faster. Btw, leading wildcards like LIKE '%batman%' don't use indices either, however, even with SQLite you can probably speed things up by first doing a rough pass LIKE %batman% then refining that with a regexp either as part of the statement like LIKE '%batman%' AND REGEXP '(?i)\bbatman\b' or just parsing the output of LIKE '%batman%' yourself.

        SCNR - have to show off postgres' regex indexing on a quickly generated 10M row table.

        -- table size: 1117 MB -- regex index size: 1727 MB (big index = disadvantage) where txt like '%batman%'; --> 'like' means: search without regex; +uses the trgm index txt + ---------------------------------------------------------------------- +----------- zdfrcj batman igihbky bpwz hzzfmq plyazufcawovvjrlwow xkllfiovgktraw + bmnuyjelj Time: 9.911 ms where txt ~ '\ybatman\y'; --> '~' means: regex search; \y: word b +oundary txt + ---------------------------------------------------------------------- +----------- zdfrcj batman igihbky bpwz hzzfmq plyazufcawovvjrlwow xkllfiovgktraw + bmnuyjelj Time: 13.179 ms

        (performance of these regex-indexes (via trigram matching from the pg_trgm extension) is good but the size can be reason not to use them.)