I'm building a text search app using DBD::SQLite and FTS4. I'm completely new to databases, but I got the basic things working well. Now there are a couple of things I need advice on.
(1) I would like to compress the databases. A lot of people will be using dbs that contain 10 million records or more (upwards of 10GB or even 50GB uncompressed). This is all UTF-8 text data, some of it very repetitive so it compresses well. How do I go about this? The DBD::SQLite cpan page mentions compression briefly, and there is a bit on this at http://www.sqlite.org/fts3.html#section_6_1 but I can find no actual working sample code that I could use. The sqlite.org page seems to assume that I will write the compress and uncompress functions myself, which is way above my pay grade. Is there a ready-made solution somewhere that I am missing?
(2) Currently, my app can handle one database file at a time. It does a search and returns the results ordered by length. I would like to add support for multiple database files, allowing the user to set a ranking and displaying hits in that order (first, all hits from 'tier 1' dbs ordered by length, then all hits from 'tier 2' dbs ordered by length etc.). Is this feasible? Would I need to
ATTACH each of the dbs to the same connection? How can I run the same query on all the dbs? Create a
foreach my $db (@databases) loop and iterate through them, executing the query in each one in turn?
(3) I would like to add match highlighting but the DBD::SQLite page says "The current FTS implementation in SQLite is far from complete with respect to utf8 handling : in particular, variable-length characters are not treated correctly by the builtin functions offsets() and snippet()." Has that been fixed since this was written or should I forget about offsets() and try to write my own code that tries to analyze each string returned from the database and find the search terms in it so I can highlight them?
I can post the my current (working) code if anyone's interested.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.