in reply to Re^2: Documentation of REGEXP support in DBD::SQLite?
in thread Documentation of REGEXP support in DBD::SQLite?

mysql or something with builtin regexps would surely be at least somewhat faster. Btw, leading wildcards like LIKE '%batman%' don't use indices either, however, even with SQLite you can probably speed things up by first doing a rough pass LIKE %batman% then refining that with a regexp either as part of the statement like LIKE '%batman%' AND REGEXP '(?i)\bbatman\b' or just parsing the output of LIKE '%batman%' yourself.

Replies are listed 'Best First'.
Re^4: Documentation of REGEXP support in DBD::SQLite?
by erix (Prior) on Nov 15, 2024 at 06:11 UTC

    SCNR - have to show off postgres' regex indexing on a quickly generated 10M row table.

    -- table size: 1117 MB -- regex index size: 1727 MB (big index = disadvantage) where txt like '%batman%'; --> 'like' means: search without regex; +uses the trgm index txt + ---------------------------------------------------------------------- +----------- zdfrcj batman igihbky bpwz hzzfmq plyazufcawovvjrlwow xkllfiovgktraw + bmnuyjelj Time: 9.911 ms where txt ~ '\ybatman\y'; --> '~' means: regex search; \y: word b +oundary txt + ---------------------------------------------------------------------- +----------- zdfrcj batman igihbky bpwz hzzfmq plyazufcawovvjrlwow xkllfiovgktraw + bmnuyjelj Time: 13.179 ms

    (performance of these regex-indexes (via trigram matching from the pg_trgm extension) is good but the size can be reason not to use them.)

      Thanks for this, I had something like trigrams in mind.

      But for full automatic regex support one would need to analyse the regex and identify the substrings which are in all execution paths ...

      Quite a task.

      Though... Perl does some optimisation when compiling its regexes and it's possible to introspect the result. It might be possible to build on this.

      On a side note: true, the index adds 150% to the datas size. But the speed gain is dramatic enough to justify it in many use cases.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

      Question: how are substrings of size 1 or 2 handled?

      Are they just ignored?

      Or are trigrams-indeces also indexed, such that "ab" is efficiently found in "cab" and "abs" ?

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

        Question: how are substrings of size 1 or 2 handled? Are they just ignored?
        explain analyze select * from azjunk7n where txt ~ 'ba'; -- '~' means: consider regex +index QUERY PLAN + ---------------------------------------------------------------------- +----------------------------------------------- Seq Scan on azjunk7n (cost=0.00..267879.16 rows=707066 width=85) (ac +tual time=5.413..9163.252 rows=897633 loops=1) Filter: (txt ~ 'ba'::text) Rows Removed by Filter: 9102367 Planning Time: 0.360 ms Execution Time: 9189.173 ms (5 rows) Time: 9190.029 ms (00:09.190)

        Nine seconds. Because, of course, if there are too many hits (here: 897633), the system switches to SeqScan - after all, a sequential scan is the fastest way to access many rows. Faster would've been: where position('ba' in txt) > 0 which would SeqScan in 3 seconds; but position() doesn't allow regexen.