in reply to Re^6: Writing a database lookup tool
in thread Writing a database lookup tool
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^8: Writing a database lookup tool
by elef (Friar) on Jan 04, 2013 at 20:05 UTC | |
"let's assume a there are 15 million records with a 100 characters in each (in the field that we're searching). I look up a 10-character string. There are 1000 hits. How much time would it take for those 1000 hits to be found if the database design and implementation is not particulary well optimized? 0.01 second? 1 second? 5 seconds?" Finding the records in which a 10-character string occurs in a given field would qualify as a "simple" query, I would think. | [reply] |
by erix (Prior) on Jan 05, 2013 at 08:31 UTC | |
(It sounds to me you need a word indexer like Lucene or SOLR -- I don't know much about those. I'll only talk about RDBMS, and then mainly PostgreSQL; I hope that's still interesting.) Finding the records in which a 10-character string occurs in a given field would qualify as a "simple" query, I would think. A query may look simple, but still be slow because it has to do a lot of work to retrieve the correct result. The usual performance complication is a query that cannot use an index, or that for another reason has to read the whole of the table data (a.k.a. a sequential scan or seqscan), and/or that cannot keep the whole of the data (or indexdata) in memory. Searching for an arbitrary substring of a field precludes for most (or all) databases the use of an index. (Word-search is of course already easier, and done by Lucene et similar). If such a query is a main or important use in your application, you can calculate with disk speed and database size the expected response time, and consider whether it's acceptable. Or simply try it out. I happen to have a 10M row table handy in Pg (postgres), called azjunk7, which is roughly similar to what you would have. It is somewhat smaller (10M rows) and narrower (1 column, each filled with 80 random characters) than what you mentioned. Here are some timings for postgres for searches for 'erix', which happens to occur 1008 times. It takes 5 seconds. (Grep takes 1 second, also shown)
(Interestingly (but admittedly not much use for you), there is an improvement in the make for postgresql 9.3 (unreleased) that enables indexing on regular expressions. On my version of that system this same query takes just 15 ms ( milliseconds ) :-) ) I for one will be interested to hear what you eventually chose, and how it performs. (All the above on PostgreSQL 9.3devel, on a simple 4 GB desktop, AMD FX8120, with a single SATA/7200 disk) | [reply] [d/l] [select] |
by elef (Friar) on Jan 18, 2013 at 11:45 UTC | |
Your query times give me pause. Ideally, I'd be shooting for about 1sec on most searches. If 5+ second query times are the norm, that kind of kills it because this thing would be used for quite frequent lookups. Obviously, I don't want to spend weeks building this thing only to find out that it's too slow to be practical when I load it up with all my data, so I'll have test any tool I consider with all the data before I go ahead. Now, many searches could be done as full word searches. Would those be radically faster than general substring searches if I were to use sql via DBI:DB or something similar, or would I need to use something that was optimised for text search (e.g. Lucene/Solr) to get that performance benefit? I posted a question on FTS in reply to NetWallah's post further down that's kind of related. At this point the whole thing is just an idea I'm considering. I guess I will play with some of the tools suggested here, and maybe look into completely different approaches like trying to use libreoffice base. | [reply] |
by choroba (Cardinal) on Jan 05, 2013 at 07:35 UTC | |
لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
| [reply] |