BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:
I have an application where I need to be able to map some strings to some integers and then be able to lookup the integers from their strings or vice versa. (NOTE: both strings and integers are unique!)
In Perl this can be (and currently is) achieved by using two hashes:
my %intByStr = populate(); my %strByInts; $strByInt{ intByStr{ $_ } } = $_ for keys %intByStr; ...
But the penalty for this is that every key and value is stored twice. Which means for a fairly modest dataset of just 450,000 key/value pairs, the total storage requirements are 95MB:
$i=1; $intByStr{ $_ } = $i++ for 'aaaa'..'zzzz';; $strByInt{ intByStr{ $_ } } = $_ for keys %intByStr;; print total_size( $_ ) for \(%intByStr, %strByInt);; 50805928 44297159
As my dataset can be somewhat larger, I'd like to minimise the space requirement. A modest improvement can be achieved by storing both mappings in a single hash:
undef %AbyB; $i=1; $AbyB{ $_ } = $i, $AbyB{ $i++ } = $_ for 'aaaa'..' +zzzz';; print total_size( \%AbyB );; 80479783
The potential space requirement for my dataset, which might grow to 10x (or 20x or more) the size of the examples above, is such that I'm looking for an alternative, even if it means coding the solution in (Inline::)C, but simply using Perl's hashes from C won't gain me anything.
I could use a more space efficient hash implementation, which I found a few of on the net; that usually come with a penalty of less flexibility which I could live with for this purpose.
Somewhere in the back of my mind I seem to recall a bidirectional lookup mechanism/algorithm that didn't require the duplicate storage, but for the life of me I cannot remember anything about how it worked; and my attempts to search for "bidirectional search algorithm" all turn up variations on this which is a shortest path between two node of a graph algorithm and thus not useful for me.
The strings are usually fairly short, 1 to 8 characters, but sometimes can extend further upto 32 or occasionally more. (Think: typical variable names!)
The integers are (unique) 64-bit values. Think memory addresses for 64-bit processors.
The data is loaded once. The two lookup mechanisms (hashes/arrays/indexes/whatever) would be built once as, or after, the data is read; so build/insert speed is not a great concern.
The vast majority of the uses will be lookups, many millions, so the priority is the lookup speed -- preferably O(1) though O(logN) might be acceptable if the memory reduction was sufficient.
The data is discarded at the end of each run. There are no deletes.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Bidirectional lookup algorithm? (judy trie?)
by Anonymous Monk on Jan 04, 2015 at 23:47 UTC | |
| [reply] |
by BrowserUk (Patriarch) on Jan 05, 2015 at 00:18 UTC | |
Have you considered a Trie like a Judy array? specifically Judy::L (int to pointer) and Judy::HS (string to int/pointer) I hadn't. In the past; I've avoided Judy arrays for the simple reason that I hate to use stuff I don't understand. And whilst I 'understand' the principles behind Judy arrays, the code is so complex that I doubt my ability to maintain it; and hate to get stuck needing other people to do things for me. And I just rediscovered another reason I've not used them:
Recurse: see recurse! With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
by Anonymous Monk on Jan 05, 2015 at 02:34 UTC | |
<Reveal this spoiler or all in this thread>
Forget it, total nonsensical Module::Build maze | [reply] [d/l] |
|
Re: Bidirectional lookup algorithm? (Updated: further info.)
by ikegami (Patriarch) on Jan 05, 2015 at 16:04 UTC | |
So you have something like the following:
A copy of the address and of the identifier are stored in the elements of each hash. (The key of an element is stored in the element to distinguish elements in collisions.) You can cut the number of scalars (or is it string buffers?) in half as follows:
Of course, constructing is going to be much slower. I think you can avoid iterating through the hash (and avoid using Data::Alias) by dropping to C, but if you can do that, you could use a C-based data structure and use strings instead of scalars. Update: It seems that only the string buffers are getting shared, not the entire scalar. | [reply] [d/l] [select] |
|
Re: Bidirectional lookup algorithm? (Solution.)
by BrowserUk (Patriarch) on Jan 11, 2015 at 18:15 UTC | |
Okay. I have my first pass solution completed, and here (as promised) is the code (in the form of an Inline::C script for RAD and testing), and a brief description. Description:The code presents a class: BiMap; which supports the following methods: The basic structure of the BiMap contains pointers to two parallel arrays: byInt[] of PAIR structures; byStr[] of U32 integers;
The PAIRs are inserted into byInt[] hashed by the (U64)addr in the PAIR. The byStr[] holds indexes (+1 to allow 0 to represent empty), of the PAIRs, hashed by the (char*)name in the PAIR. Using indexes into byInt[] rather than address of the pairs themselves saves half the space at the cost of an extra indirection. As a large majority of the names will be less than 8 characters; when that is the case the char* name component of the PAIR is used to hold the string directly, rather than a pointer to it, thus saving another 8-bytes per compliant PAIR. The high bit of the 8th byte is set to distinguish between directly stored strings and addresses of strings. I tested several combinations of: a) various hash functions; and b) various probing algorithms; and in my (simplistic) testing; no other combination beat the performance of the (Jenkins) one-at-a-time hash function erstwhile used by Perl; and the very simple, +1 probing mechanism. Results:Using a single, combined Perl hash to look up 11 million strings from their associated 64-bit integer address, and vice versa, takes a total of 15.9 seconds and requires 4.17 GB:
Using BiMap on the same dataset using a table presized to accomodate 16 million pairs, and a 0.71 fill factor. takes 22.3 seconds and requires just over 1/2GB:
By moving to a fill factor of 0.70 and pre-sizing the arrays to accommodate 33 million pairs, takes 19.8 seconds and requires just over 3/4GB:
Conclusion:Using a BiMap in preference to a Perl hash will allow 8 times as much data to be processed per run; thus both reducing the number of runs required whilst increasing the statistical validity of the results by well over 8 times. Size for size, the total runtime of the tests will increase by approximately 40%. A fair penalty for the substantial increase in statistical validity. Other options:
The code: <Reveal this spoiler or all in this thread>
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
| [reply] [d/l] [select] |
by tye (Sage) on Jan 11, 2015 at 22:08 UTC | |
It seems possible that Judy arrays would reduce the space requirement further and (possibly) increase the performance. Just a shame that the Build mechanism is broken. Since Judy seemed the most promising approach in my personal opinion, I did take a quick look at the complexities of getting the Perl module working under Windows. There seems to be at least 3 ways to build the Judy library itself under Windows:
The last choice seems, by far, the easiest. The code included in Alien::Judy even includes the *.bat file. But since I didn't already have Visual Studio, I stopped looking at that point. I didn't reply as what I discovered was found easily so I guessed you'd either find the same thing soon enough or wouldn't be that interested in using Judy. Your node above (thanks for sharing your implementation) makes me think you may still be interested in Judy and also didn't yet find the *.bat file. So that little bit of information may encourage you and/or just save you some small bit of research. The next steps (getting Judy to not fail to build just because Alien::Judy isn't installed and getting the build process for Judy to find the needed include and library files) seem to me to likely be fairly simple. I may still eventually get around to jumping through those hoops (though, to be honest, there are a lot of higher-priority items on my to-do list right now). If I do, I'll post more details. If you do, please also post what you find. Thanks. - tye | [reply] |
by BrowserUk (Patriarch) on Jan 14, 2015 at 22:02 UTC | |
I finally managed to build a 64-bit version of Judy. I started with this one-file/1250 line version and hacked out all the -DSTANDALONE and -DASKITIS stuff along with all the BIG_ENDIAN stuff; extracted a Judy.h; and got the filesize down to 965 lines and built it into a dll:
I then wrote a C program to us it to create two Judy arrays and stored my test data 'aaaaa'..'zzzzz' paired with a 64-bit integer: built it against the dll:
A run:
Then I built it as an Inline::C module, adding method wrappers for the important functions: Unfortunately, in this form, the runtime increase -- mostly I think due to the perl->C->perl transitions -- from 6.5 seconds to over 25s:
So, whilst it does use somewhat less memory than my BiMap version; is also somewhat slower. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
| [reply] [d/l] [select] |
by syphilis (Archbishop) on Jan 15, 2015 at 00:49 UTC | |
by BrowserUk (Patriarch) on Jan 15, 2015 at 06:33 UTC | |
| |
by Anonymous Monk on Jan 12, 2015 at 00:07 UTC | |
Read more... (935 Bytes)
The Judy-0.41-new.patch Read more... (5 kB)
All tests pass except for the odd Judy-0.41\t\41typemap.t which tries to compile Judy-0.41/t/MAGIC/lib/MAGIC.xs ... which would have to be patched similarly but I didn't bother | [reply] [d/l] [select] |
by syphilis (Archbishop) on Jan 12, 2015 at 11:01 UTC | |
by BrowserUk (Patriarch) on Jan 12, 2015 at 11:53 UTC | |
| |
by BrowserUk (Patriarch) on Jan 12, 2015 at 00:34 UTC | |
by Anonymous Monk on Jan 12, 2015 at 02:04 UTC | |
by bitingduck (Deacon) on Jan 24, 2015 at 06:50 UTC | |
This may not be as horrible as it sounds if your code is going to get a lot of use. Minimal Perfect Hashing looks like the only way you're going to get O(1) on lookups, and if you're going to be getting bigger and bigger data sets, the time up front starts to look more cost effective. A bit of digging around found a real algorithm that someone actually implemented in C and tested, designed for use with both integer keys and character keys. Unfortunately I've only found the original paper and a later one with pseudocode, nothing quite canned. The paper with pseudocode is here: Czech, Havas, Majewski, and an earlier paper that's a little denser to interpret is here Havas, Majewski. This page: Schnell gives a pretty readable interpretation of it as well. It doesn't look too painful to implement. I also found what looks like a similar algorithm that appears to include bi-directionality implemented in a python package call pyDAWG that you could either call or port. A little more digging around on the relevant names might find you some c code that does about the same EDIT: allegedly there's a Java package floating around called "ggperf" that's an implementation of the CHM algorithm (and is supposed to be faster than a "gperf" minimal perfect hash generator that's part of libg++) but I couldn't find source, just a paper that amounts to documentation for ggperf | [reply] |
by BrowserUk (Patriarch) on Jan 24, 2015 at 13:15 UTC | |
bitingduck thankyou for your research. The links provided for fascinating, (if at times somewhat bewildering :) reading. In particular the Schnell article made the CHM algorithm understandable; and the pyDAWG code is relatively concise and understandable. I think I understand the process (if not the implementation) enough to see a problem (for my application). Given that my datasets are generated at runtime, and the goal is to be able to make them as large as available memory will allow (but no larger), it would be necessary to run the CHM algorithm within the same process, and after the dataset has been generated. The problem is, that even using the most parsimonious implementation of a DAG, the combined size of the node and edge structures (with their required pointers) require substantially more memory per string-number pair than the data itself. Using the pyDAWG implementation as an example; each pairing requires two DAWGNodes at 18bytes each; and each word required one DAWGEdge struct per letter (say ave. 5*9 + 8*9 = 117bytes ). Thus a quick estimate is that for each 13 bytes of data (5-byte string + 8-byte integer); the algorithm would require a minimum of 117+38 = 155 bytes of structs * no_of_pairs, in order to build the DAG. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
| [reply] |
by bitingduck (Deacon) on Jan 24, 2015 at 17:14 UTC | |
by BrowserUk (Patriarch) on Jan 24, 2015 at 19:22 UTC | |
by shmem (Chancellor) on Jan 11, 2015 at 20:27 UTC | |
That line below your signature is somewhat enigmatic: The code present a class:BiMap, which supports the following methods: 7 ) { strncpy( (char*)( iHash script for RAD and testing), and a brief description. , CLEAN_AFTER_BUILD =used, newSize ); for( i = 0; i = pair; bm--1 I've tried setting up a bidirectional hash using XS code. It makes both key and values into keys and misuses the IV slot of SV of a hash entry to store the pointer to the crossover hash key. However, the benchmarks are disappointing:
Improvement of 22% in memory and more than double time for lookup? Why? Is it due to method call/XS overhead? XS code below. <Reveal this spoiler or all in this thread>
perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
| [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Jan 11, 2015 at 21:48 UTC | |
That line below your signature is somewhat enigmatic: It's an automatic appending of a collection of random snippets from the body of the post. It happens quite often when I post here. I try to remember to check for them and remove them; but sometimes I forget. I'm told (by the PMDevs) that it is due to the browser I use -- Opera -- but the fact that it never happens on any other site I use makes me think it is more to do with PM than Opera; but its always easier to blame someone else's tools than fix your own code. more than double time for lookup? Why? Is it due to method call/XS overhead? I would assume so. Any XS code is always going to be at a disadvantage compared to built-ins; that's inevitable. Which means you really have to parr everything down to even begin to compete. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
| [reply] |
by Anonymous Monk on Jan 11, 2015 at 23:00 UTC | |
by BrowserUk (Patriarch) on Jan 12, 2015 at 00:22 UTC | |
|
Re: Bidirectional lookup algorithm? (Updated: further info.)
by soonix (Chancellor) on Jan 05, 2015 at 10:22 UTC | |
| [reply] [d/l] |
by BrowserUk (Patriarch) on Jan 05, 2015 at 13:15 UTC | |
That certainly is worth considering as it reduces the memory requirement to ~1/6th compared to the single hash method. However, the penalty is that the lookups run ~30x more slowly, so a job that currently takes ~8hrs would require 10 days! Hence, it will be a last resort for when I cannot find anything quicker. Single hash:
Sqlite & :memory:
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by Anonymous Monk on Jan 05, 2015 at 11:36 UTC | |
| [reply] |
|
Re: Bidirectional lookup algorithm? (Updated: further info.)
by RichardK (Parson) on Jan 05, 2015 at 13:25 UTC | |
Sorry, I don't have any pointers on bidirectional lookups, but have you considered using DB_File to tie your hash to a database file?. It looks like your data will map nicely to a BTREE and lookups should be fast. You could run the database completely in memory or tweak the cache size to suit your requirements. It will be easy to try with your existing code if you want to give it a whirl. I tried a single hash btree from your example above ( 'aaaa' - 'zzzz') and it created a ~11Mb file, so it seems to pack pretty well. | [reply] |
|
Re: Bidirectional lookup algorithm? (Updated: further info.)
by roboticus (Chancellor) on Jan 05, 2015 at 19:11 UTC | |
I have some couple questions: There are likely other questions I could/should ask, if only I could think of them... ...roboticus When your only tool is a hammer, all problems look like your thumb. | [reply] |
by BrowserUk (Patriarch) on Jan 05, 2015 at 20:02 UTC | |
I'm seeking a way to reduce the memory footprint without too much loss of performance, in order that I might increase the statistical validity of the simulation. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
by roboticus (Chancellor) on Jan 08, 2015 at 13:11 UTC | |
| [reply] | |
|
Re: Bidirectional lookup algorithm? (try perfect hashing)
by oiskuu (Hermit) on Jan 07, 2015 at 17:09 UTC | |
You could lump longer symbols together a la join qq(\0), grep length() > 8, @syms; access via ptr or offset. Keep short ones directly in the 8-byte slots of the symbol table. (union). This should get you down to ~10 bytes per key. To index the tables, use minimal perfect hashing. There's the CMPH library, which appears quite simple to use. Though you'll need to work on the glue — cmph bindings via Perfect::Hash are incomplete, and it's not on CPAN. Anyway, with perfect hashing you need but a few extra bits to do the indexing. Various methods are available. CHM needs ~8 bytes per key, but preserves order. BRZ is about ~8 bits per key, uses external memory while building the tables so you're not mem-constrained. BDZ is just under 3 bits per key. Storing the 8-byte keys as you are already, these few extra bits aren't much. Performance estimates: about 3 sec per million keys when constructing the mph structures; about .3 sec per million lookups. Compared to .15 usec lookups with direct hashes. So expect roughly 3 times slower performance with about 10 times the memory density. If the tradeoff is worth the effort, I'll leave for you to decide. | [reply] [d/l] [select] |
by oiskuu (Hermit) on Jan 11, 2015 at 02:52 UTC | |
Well, I experimented a little further with the CMPH. It might be rough around the edges (e.g. temp. files aren't cleaned up or created safely), but the good thing is it basically works. First, I generated some data as follows.... that took awhile. In case many longer keys are present, it might be better to go with (4-byte) offset records. Simple and compact, but there's one more indirection, hence slower access.
Then a small test. You can see I didn't bother mapping value indexes but used order-preserving hash instead. Memory usage comes to about 26 bytes/pair. Double that while building the hashes.
Edit. Same test with updated code; higher -O used.
<Reveal this spoiler or all in this thread>
| [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Jan 11, 2015 at 09:46 UTC | |
oiskuu. Thanks for pursuing this and posting your findings. Unfortunately, I do not understand what I am reading? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
| [reply] |
by oiskuu (Hermit) on Jan 13, 2015 at 02:21 UTC | |
Ok. Here's a crude demo in case someone is interested in doing benchmarking comparisons or whatever. The program is pure C, tested on linux 64-bit. Give two arguments: Nkeys and Filename. File should contain sym-value pairs, both unique in their domains. (Space-separated, one pair per line).
Update. Some additional notes regarding the program. Read more... (1139 Bytes)
| [reply] [d/l] [select] |
|
Re: Bidirectional lookup algorithm? (Updated: further info.)
by Eily (Monsignor) on Jan 07, 2015 at 19:28 UTC | |
I don't know if you have found what you were looking for or not, but here is another way to win a little space: hashes use the stringified value of keys, so most 64 bits ints will be written with 18 bytes or more*. But if you pack them before, it will only use 8 bytes. This shoudln't change much for values, though it would avoid the scalars from having both a string and int value. On my computer:
(numbers past a certain point are stringified to the same thing) Think: typical variable names!if that's \w+, that's 63 possible values per byte. So there probably is a possibility of compression here as well, even maybe huffman coding. But if most of those strings are only up to 8 bytes long, I'm not sure there's going to be an actual improvement (though in this case, this will save space both in the keys and the values). * Numbers below 2^27 may have less than 4 digits in base 10, but that's one number every 2^37. And numbers above 2^30 are all shorters once packed than stringified, and there's 2^34 more numbers above that point than below. | [reply] [d/l] [select] |
by graff (Chancellor) on Jan 07, 2015 at 23:06 UTC | |
Apart from that, there's a potential risk that some 64-bit int values, when packed into an 8-byte string, might collide with actual strings encountered in the data. There's a better way to pack integers into smaller strings. The following "encode/decode" functions will always return a string of characters in the range "\x80" - "\xff", and the number of characters will be roughly half the number of decimal digits. (I'm assuming that BrowserUK's strings will always be in the ASCII range, so I'm using non-ASCII characters that happen to be stored internally as single bytes in perl): (There might be more efficient ways to implement that.) Even so, the overall space saved seems a bit disappointing - on my (macports, 5.12) version of perl, encoding the numbers like that over the ~914K entries in BrowserUK's single-hash example (strings 'aaaa' .. 'zzzz', numbers incrementing from 1) yielded just 4.1% reduction in the total_size of the hash. (If I use random integers ranging up to 15 digits, it edges closer to 6% reduction.) | [reply] [d/l] |
by BrowserUk (Patriarch) on Jan 07, 2015 at 20:30 UTC | |
(numbers past a certain point are stringified to the same thing) I have 64-bit ints here, so that isn't a problem for me. I did consider packing the keys, but the space saved is only around 15% to 20%; but the performance penalty is considerable. Thanks. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by oiskuu (Hermit) on Jan 09, 2015 at 05:05 UTC | |
Exponentiation is implemented via pow internally, so the result is forced to floating point. Better use the shift operator: $a = (1<<53)+1; ... Set size of 35e6 has been mentioned, and the desire to push it further. Elements are unique. In decimal notation, certainly more than half of the values are going to be at least eight digits. Similar concerns hold for the symbols. 35e6 is just about 25 bits, but there would be no problem if the coding weren't sparse to begin with, so let's add a few bits for good measure. Base64 is 6 bits per char, [a-z] is 4.7 bits. In those cases, most of the symbols are going to be no less than five or six characters, respectively. | [reply] [d/l] [select] |
|
Re: Bidirectional lookup algorithm? (Updated: further info.)
by shmem (Chancellor) on Jan 05, 2015 at 18:02 UTC | |
Since your strings and integers are unique - why don't you use a hash for strings and an array for integers
and look up the strings by index into the array, and the integers by key into the hash?
perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
| [reply] [d/l] |
by Laurent_R (Canon) on Jan 05, 2015 at 19:00 UTC | |
| [reply] |
by shmem (Chancellor) on Jan 05, 2015 at 22:34 UTC | |
Right you are.
Having 8 GB RAM. 2**28 doesn't die.
perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
| [reply] [d/l] |
|
Re: Bidirectional lookup algorithm? (Updated: further info.)
by shmem (Chancellor) on Jan 06, 2015 at 16:28 UTC | |
It seems to me that you need a simplified hash datastructure, where adding a tuple (hk,he) to the hash stores the he as a new hk in the hash table and makes the entries of both keys into pointers to each other. That structure would in fact be a hash consisting only of keys and pointers to keys. This hash structure probably would consume considerably less memory space than a full featured perl hash, whilst being maybe even faster at lookup. Maybe it is feasible to set up such a thing via XS and (possibly misusing) perl macros. Looks interesting enough to try it... update: I've made some meek attempts to cobble together such a thing and wormed myself through macros... but I didn't even scratch the surface, I'm afraid. For such a thing to be possible, a HE should be able to hold a single PV -and nothing more. But AFICS a hash value must be a SV, which means that even for undef hash values, the whole struct sv is allocated. Populating a hash with undef values makes no difference in size to populating it with integers. Which leads me to think about a huge optimization possibility regarding memory consumption. Why do we need a whole struct SV for empty value fields? Shouldn't they be subject to autovivification?
perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
| [reply] [d/l] |
|
Re: Bidirectional lookup algorithm? (Updated: further info.)
by salva (Canon) on Jan 06, 2015 at 01:10 UTC | |
I have not investigated it deeply enough to known if it is usable for your purposes. | [reply] |
|
Re: Bidirectional lookup algorithm? (Updated: further info.)
by flexvault (Monsignor) on Jan 06, 2015 at 01:53 UTC | |
Just a thought about my attempt about 'Bidirectional Lookup'. A technique used with 'Big Data' databases is 'database sharding'. It is usually used to split a very large databases between multiple computers. But for your requirements, it could be fulfilled using an array of multiple scalars. Depending on the 'key/address' you could select where the start and end would be for each element of the array, so you could use 'index' on a sub-set of the data. Obviously, the larger the scalar, the longer the 'index' function will take. To use 'sharding' you have to know a lot about your data, but it may improve the lookup results dramatically, which is what you seem to need to satisfy your requirements. Just a thought! Regards...Ed "Well done is better than well said." - Benjamin Franklin | [reply] |
|
Re: Bidirectional lookup algorithm? (Updated: further info.)
by andal (Hermit) on Jan 05, 2015 at 15:57 UTC | |
Well, in C I would keep the array with structures (string + integer) and then add 2 more arrays with integer offsets into array with structures. The latter 2 arrays are sorted appropriately. Then lookup speed is O(logN). The above approach is simple, saves space and still provides descent speed for lookups in both directions. One can also use random records instead of array with all records, but then "indexes" maybe be larger on 64-bit systems since addresses are 8 bytes and offsets can be kept 4 bytes. | [reply] |
by oiskuu (Hermit) on Jan 05, 2015 at 17:14 UTC | |
I was thinking pretty much along the same lines. The OP, however, seems to want it both compact and fast. Binary search will likely not satisfy the latter requirement. Couple links I googled up that might be of interest here: | [reply] |
|
Re: Bidirectional lookup algorithm? (Updated: further info.)
by jmacloue (Beadle) on Jan 05, 2015 at 16:26 UTC | |
It seems you are asking a wrong question - if you want to do fast lookups you need to have a pair of sorted lists (e.g., two hashes) which need lots of RAM, if you want to use as little RAM as possible - you will need to search over an unsorted list during at least one lookup which of course is very slow. I'm all with the guys suggesting using a ready-made solution like SQLite (and, yes, it is slow but not as slow as walking through an unsorted array in Perl). Unless, of course, your data allows some kind of optimization - e.g., if your keys and values are sorted so that the order is the same (like pairs (1,3), (2,4), (3,5) ...). In this case you can implement, for example, a binary search algorithm over a part of array and beat SQLite performance. | [reply] |
|
Re: Bidirectional lookup algorithm? (Updated: further info.)
by herveus (Prior) on Jan 05, 2015 at 19:20 UTC | |
Can you afford the memory footprint? If you have the RAM to spare, you should be getting O(1) lookups. On a Solaris box, I took your sample code and added another letter to get to most of 12 million entries for a memory footprint that was up in the 4 gigabyte class (3738M). I'd expect that the hash implementation will be pretty time efficient compared to the alternatives, and it sounds like that's the main driver (assuming the RAM needs are not limiting).
yours, Michael | [reply] |
by BrowserUk (Patriarch) on Jan 05, 2015 at 20:08 UTC | |
The current hash lookups are very fast -- possibly as fast as I could hope to get -- the problem I'm trying to address is the memory consumed by the duplication of keys & values. As I said in the OP: "so the priority is the lookup speed -- preferably O(1) though O(logN) might be acceptable if the memory reduction was sufficient." With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
Re: Bidirectional lookup algorithm? (Updated: further info.)
by polettix (Vicar) on Jan 05, 2015 at 22:37 UTC | |
And yes, I might be totally off track! perl -ple'$_=reverse' <<<ti.xittelop@oivalf Io ho capito... ma tu che hai detto? | [reply] |
by Anonymous Monk on Jan 05, 2015 at 23:15 UTC | |
A simple perl program adds 1mil key/value pairs in a hash in in 2.6875 seconds on a laptop from 2006 OP doesn't need concurrency , so redis not needed | [reply] [d/l] |
by polettix (Vicar) on Jan 10, 2015 at 22:15 UTC | |
Any benchmark should probably consider both aspects. perl -ple'$_=reverse' <<<ti.xittelop@oivalf Io ho capito... ma tu che hai detto? | [reply] |
by BrowserUk (Patriarch) on Jan 10, 2015 at 23:07 UTC | |
by polettix (Vicar) on Jan 12, 2015 at 09:17 UTC | |
by locked_user sundialsvc4 (Abbot) on Jan 05, 2015 at 23:06 UTC | |
With due deference to Brian, I must be candid in expressing my extreme doubts as to whether “an entirely separate server” could possibly be apropos in the present case. (I understand the situation that he is speaking to, and I don’t think that the present situation qualifies.) To my admittedly simple-minded way of thinking, the present problem is a problem that quite-obviously could quite-readily be solved by the use of two in-memory indexes to an in-memory data structure ... a-n-d / o-r by two indexes on an SQL table. To arrive at an appropriate solution to this problem, we do not need to climb the mountain of exotica. We can satisfy ourselves just as well with the perfectly mundane. The only decision that we must now arrive-at is this: “exactly what are the ‘ruling constraints’ to this problem?” Then, “what is the simplest solution that will satisfy it?” The only constraint that has so-far presented itself is ... the availability of RAM. However, the validity of this constraint as-presented rests entirely upon the supposition that “the entire data-structure must reside completely in (physical(!) ...) RAM.” If this were the case, then the answer would consist of simple arithmetic. However, I perceive that this is not the only practicable solution. This problem can be solved in a satisfactory manner regardless of memory-size. (We’ve been managing such things for fifty-plus years.) “In-memory” approaches are unquestionably fast ... i-f “as much memory as one might desire” is, in fact, available ... on production machines as well as developer boxes ... “without delay.” In the real world, alas, this is not the case, and therefore we are obliged to resort more-or-less to files. (And, once we do that, our physical-RAM working-set requirements are dramatically reduced ... thereby re-defining the problem entirely.) The [only (!) ... ] “in-memory” solution to this problem is actually trivial: “you must define two indexes to the data-structure, in addition to the data-structure itself, and therefore you must possess enough RAM to simultaneously hold all three and to access all three of them without any fear of virtual-memory paging delays. Either you do have such luxury ... o-r ... you don’t. Nothing more need be debated nor discussed. If “an embarrassment of RAM-riches” cannot be 100% counted-upon, then one is pushed rather-decisively toward the use of a database ... and thus upon an approach that must regard every query to that structure as being “more or less costly.” The “cost-equals-zero proposition” of a pure-RAM solution is eliminated ... as are all of ts theoretical advantages. “Welcome to the Real World ...” | |
by BrowserUk (Patriarch) on Jan 06, 2015 at 00:46 UTC | |
“exactly what are the ‘ruling constraints’ to this problem?” The ruling constraints are clearly and concisely laid out in the OP; you're either too lazy to read them properly; or too dumb to understand them. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
| |
|
Re: Bidirectional lookup algorithm? (Updated: further info.)
by flexvault (Monsignor) on Jan 06, 2015 at 00:27 UTC | |
I came into this late, but I think using Perl's great 'index' function may give you the desired results. I copied your scripts to get a start, and then used a simple scalar to build the 'Bidirectional lookup' functions. I'm sure they can be improved! Here goes: I didn't build my $Both scalar from the raw input, so that I could verify I did it correctly, but $Both doesn't need to be in any order for this to work. Please note the subs return 'undef' if the key or address is not in $Both. Also I would use 'pack' on the address, but that may or may not save space depending on the actual data supplied. Regards...Ed "Well done is better than well said." - Benjamin Franklin | [reply] [d/l] |
by BrowserUk (Patriarch) on Jan 06, 2015 at 02:17 UTC | |
If you look at my code in Re^2: Bidirectional lookup algorithm? (Updated: further info.), using a hash does lookups by name in: 0.976562/3655808 = 0.000000267 secs; and by addr in 1.065514/3655808 = 0.000000291.
And using an in-memory sqlite DB does them ~30 times slower at: by name in: 30.339751/3655808 = 0.0000083; and by addr in: 30.587817/3655808 = 0.0000084 for a memory reduction of 5/6ths.
And using a disk-based sqllite db does them ~57 times slower at: by name in: 6.867170/456976 = 0.0000150; and by addr in: 6.949115/456976 = 0.0000152 seconds for a memory reduction of ~3/5ths:
The (single, and possible non representative depending upon where in the strings the chosen keys are (beginning middle or end)) timings you've posted are 800x slower again than the memoryDB, or nearly 5000 times slower than the hashes. The net result would be that my current 8 hour runs would require 4.5 years. And that's before I added the extra data that the space reduction would permit. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by flexvault (Monsignor) on Jan 06, 2015 at 11:10 UTC | |
When programming my database engine, I found that 'index' did best with smaller scalars ( reason for mentioning 'database sharding' ). My times were close to the worst case ( worst case is 'not found' ). I found that with a 8-byte key, the 'index' technique I used was 4.7 times faster searching 512 byte records than searching 1,024 byte records. Naturally, 'index' searching a 5MM+ byte record is going to be real slow! My attempt was to give you another approach, not a final solution. Good Luck...Ed "Well done is better than well said." - Benjamin Franklin | [reply] |
|
Re: Bidirectional lookup algorithm?
by QM (Parson) on Jan 08, 2015 at 13:31 UTC | |
Does it buy anything to create a hash with keys of some known hash function, and values as indexes pointing into an array? There's one more lookup, but into an array, which should be much cheaper than another hash lookup.
Without a real hash function, this gives:
...which is only slightly smaller than your original 2-hash result. Mind you, I'm on a 32-bit perl, so that could be the difference. (Caveats: I've not used a real hash function. I've not accounted for hash collisions.) -QM | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Jan 08, 2015 at 15:26 UTC | |
Essentially, that is what I am doing. One array of structs contains the paired data. This is ordered by hashing one of the values. A parallel array of ints contains indexes into the first array and are ordered by hashing the second value, Once I've fixed a couple of edge cases I know about, tested it some more, and probably moved it from linear to hopscotch probing, I'll post the code in this thread. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
Re: Bidirectional lookup algorithm? (Updated: further info.)
by locked_user sundialsvc4 (Abbot) on Jan 05, 2015 at 13:34 UTC | |
I’m not sure that I am hearing all of the potential decision factors here. How much memory does this machine have; how much memory can your application be assured of having access to without paging; how many keys will there be? I would, as soon as possible, construct a worst-case test case and run it on a production machine. Random data, realistic volume, and most importantly, realistic request-distributions. “In-memory” solutions look very attractive on developer machines and at less-than-production loads. But virtual memory really should be looked-upon as a disk resource, as well as a preemptive-priority demand for another limited resource (RAM). Solutions that were “very efficient” on developer machines can become 250,000-kilo elephants with enormous working-set sizes. They apply excessive pressure on other processes, and suffer the most from the pressure that they, in effect, exert most upon themselves. When they fall, they fall hard. On the other hand, a “file-based” solutions are steady. All disk-I/O is routinely buffered, and operating systems will stuff otherwise-unused memory with caches. When memory pressure appears, they are the first to go, but unless it does appear, they’re nearly as efficient as memory, especially when a file is memory-mapped. (Then, the data is mapped using page/segment tables but it is not treated as high-priority backing store.) When you treat the resource as you would treat a file, e.g. requesting data in chunks, they hold-up even better under stress. My gut tells me that a traditional database, e.g. SQLite, without specifying a memory-mapped file, will turn out to be the most efficient solution for you ... especially if you can request the data that you need in groups of a few hundred keys at a time. While it might not be “fastest” in the short run on your dev box, “steady, steady wins the race.” | |
| A reply falls below the community's threshold of quality. You may see it by logging in. | |