Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Bidirectional lookup algorithm? (Solution.)

by BrowserUk (Patriarch)
on Jan 11, 2015 at 18:15 UTC ( [id://1112899]=note: print w/replies, xml ) Need Help??


in reply to Bidirectional lookup algorithm? (Updated: further info.)

Okay. I have my first pass solution completed, and here (as promised) is the code (in the form of an Inline::C script for RAD and testing), and a brief description.

Description:

The code presents a class: BiMap; which supports the following methods:

  • my $bm = BiMap::new( (U32)$initSize, (double)$factor );

    initSize is the initial size the tables are allocated to.

    (Fill) factor is a double representing the proportion of the table that will fill before it is resized to double its current size.

  • $bm->add( (u64)addr, (char*)$name );

    Add a paired integer and string to the BiMap.

    (Currently) returns the number of (additional) probes required to add the pair to both tables. (For debugging and tuning purposes.)

  • my $addr = $bm->findByStr( (char*)$name );

    Lookup an integer from its paired string.

  • my $name = $bm->findByInt( U64 i );

    Lookup a string from its paired integer.

  • Some utility/debug methods that may or may not persist into the final version:
    • my $size = $bm->size;

      (Getter only!)

    • my $used = $bm->used;

      (Getter only!)

    • my $factor = $bm->factor;

      (Getter only!)

    • $bm->dump( (BOOL)$flag );

      Dumps the BiMap to stdout. (Should probably be to stderr, but attempting to use stderr from I::C causes segfaults.)

      If flag is false only dumps the header rather than the full table.

The basic structure of the BiMap contains pointers to two parallel arrays: byInt[] of PAIR structures; byStr[] of U32 integers;

typedef struct { U64 addr; char *name; } PAIR; typedef struct { PAIR **byInt; U32 *byStr; U32 size; U32 used; double factor; } BIMAP;

The PAIRs are inserted into byInt[] hashed by the (U64)addr in the PAIR.

The byStr[] holds indexes (+1 to allow 0 to represent empty), of the PAIRs, hashed by the (char*)name in the PAIR. Using indexes into byInt[] rather than address of the pairs themselves saves half the space at the cost of an extra indirection.

As a large majority of the names will be less than 8 characters; when that is the case the char* name component of the PAIR is used to hold the string directly, rather than a pointer to it, thus saving another 8-bytes per compliant PAIR. The high bit of the 8th byte is set to distinguish between directly stored strings and addresses of strings.

I tested several combinations of: a) various hash functions; and b) various probing algorithms; and in my (simplistic) testing; no other combination beat the performance of the (Jenkins) one-at-a-time hash function erstwhile used by Perl; and the very simple, +1 probing mechanism.

Results:

Using a single, combined Perl hash to look up 11 million strings from their associated 64-bit integer address, and vice versa, takes a total of 15.9 seconds and requires 4.17 GB:

C:\test>1112165-hash Start: mem: 7,920 K 23762752 Hash built: mem: 4,382,828 K Fetch by Name took 7.539046 seconds Fetch by Addr took 8.349479 seconds Hash queried: check mem: 4,382,852 K

Using BiMap on the same dataset using a table presized to accomodate 16 million pairs, and a 0.71 fill factor. takes 22.3 seconds and requires just over 1/2GB:

C:\test>BiMap -S=5 -F=0.71 -I=12000000 -D=0 bless(do{\(my $o = 52848976)}, "BiMap") Fetch by Addr & Fetch by Name took 22.257803 seconds for 11881376 item +s in a 16777216 sized BiMap Mem: 580,552 K

By moving to a fill factor of 0.70 and pre-sizing the arrays to accommodate 33 million pairs, takes 19.8 seconds and requires just over 3/4GB:

C:\test>BiMap -S=5 -F=0.70 -I=20000000 -D=0 bless(do{\(my $o = 52848976)}, "BiMap") Fetch by Addr & Fetch by Name took 19.896479 seconds for 11881376 item +s in a 33554432 sized BiMap Mem: 777,532 K

Conclusion:

Using a BiMap in preference to a Perl hash will allow 8 times as much data to be processed per run; thus both reducing the number of runs required whilst increasing the statistical validity of the results by well over 8 times.

Size for size, the total runtime of the tests will increase by approximately 40%. A fair penalty for the substantial increase in statistical validity.

Other options:

  • It seems possible that Judy arrays would reduce the space requirement further and (possibly) increase the performance. Just a shame that the Build mechanism is broken.

    I'm also considering trying a minimal radix tree/Trie implementation to see how it compares. It won't achieve the space reduction of a Judy Array, but as it automatically 'compresses' any prefix overlap in the key sets, it might be competitive and should be fast as it avoid both hashing and some indirection.

  • Perfect hashing/Minimal perfect hashing seems like it might both reduce the space requirement and speed up the lookups.

    However, generating such functions is non-trivial if you start from scratch, but life's too short to try extract useful information from the SourceForge website for the CMPH project.

    The examples suggest that it will only take the inputs from disk; which is a pain as for this case, the PAIRs are generated entirely in memory and writing them out to disk, only to have them read back in, would negate any gains.

    Of course, it could be that there is an api for supplying the keys directly from memory, but as there doesn't appear to be any API documentation; nor a way to view the individual files via SF, who knows.

    In addition, I work in the Windows world and experience tells me that it would probably take longer to work out how to build the library on Windows than to re-write it from scratch. All of these OSS projects seem to have interminable dependencies upon autoconf or cmake or any of a dozen other variations of build tools that either don't work on windows, or if they do, themselves have half a dozen more dependencies some or all of which that don't.

  • Other/"better" hashing functions and/or probing algorithms.

    Having tried several of each, there seems to be no particular benefit for my particular application data, without moving to separate chaining, which incurs the dual penalties of extra complexity and extra memory; taking things back in the direction of the well-tuned generality of the Perl implementation.

The code:


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Replies are listed 'Best First'.
Re^2: Bidirectional lookup algorithm? (Judy)
by tye (Sage) on Jan 11, 2015 at 22:08 UTC
    It seems possible that Judy arrays would reduce the space requirement further and (possibly) increase the performance. Just a shame that the Build mechanism is broken.

    Since Judy seemed the most promising approach in my personal opinion, I did take a quick look at the complexities of getting the Perl module working under Windows. There seems to be at least 3 ways to build the Judy library itself under Windows:

    1. Cross-compile it on Unix.
    2. Configure it by hand and install GNU Make (and maybe gcc) to build it.
    3. Use the included *.bat file and build it using Visual Studio.

    The last choice seems, by far, the easiest. The code included in Alien::Judy even includes the *.bat file. But since I didn't already have Visual Studio, I stopped looking at that point.

    I didn't reply as what I discovered was found easily so I guessed you'd either find the same thing soon enough or wouldn't be that interested in using Judy.

    Your node above (thanks for sharing your implementation) makes me think you may still be interested in Judy and also didn't yet find the *.bat file. So that little bit of information may encourage you and/or just save you some small bit of research.

    The next steps (getting Judy to not fail to build just because Alien::Judy isn't installed and getting the build process for Judy to find the needed include and library files) seem to me to likely be fairly simple.

    I may still eventually get around to jumping through those hoops (though, to be honest, there are a lot of higher-priority items on my to-do list right now). If I do, I'll post more details. If you do, please also post what you find.

    Thanks.

    - tye        

      I finally managed to build a 64-bit version of Judy.

      I started with this one-file/1250 line version and hacked out all the -DSTANDALONE and -DASKITIS stuff along with all the BIG_ENDIAN stuff; extracted a Judy.h; and got the filesize down to 965 lines and built it into a dll:

      C:\test\Judy>cl /W3 /Ot /favor:INTEL64 /MT /LD Judy.c Microsoft (R) C/C++ Optimizing Compiler Version 15.00.21022.08 for x64 Copyright (C) Microsoft Corporation. All rights reserved. Judy.c Microsoft (R) Incremental Linker Version 9.00.21022.08 Copyright (C) Microsoft Corporation. All rights reserved. /out:Judy.dll /dll /implib:Judy.lib Judy.obj Creating library Judy.lib and object Judy.exp

      I then wrote a C program to us it to create two Judy arrays and stored my test data 'aaaaa'..'zzzzz' paired with a 64-bit integer:

      built it against the dll:

      C:\test\Judy>cl /W3 /Ot JudyTest.c Judy.lib Microsoft (R) C/C++ Optimizing Compiler Version 15.00.21022.08 for x64 Copyright (C) Microsoft Corporation. All rights reserved. JudyTest.c Microsoft (R) Incremental Linker Version 9.00.21022.08 Copyright (C) Microsoft Corporation. All rights reserved. /out:JudyTest.exe JudyTest.obj Judy.lib

      A run:

      C:\test\Judy>JudyTest.exe aaaaa Check memory: Bidi lookup of 11881376 pairs took: 6.325 seconds Check memory: 524,332k

      Then I built it as an Inline::C module, adding method wrappers for the important functions:

      Unfortunately, in this form, the runtime increase -- mostly I think due to the perl->C->perl transitions -- from 6.5 seconds to over 25s:

      C:\test\Judy>perl Judy.pm -ODO=aaaaa Memory before building Judy: 10,760 K Memory after building Judy: 347,660 K Bidi lookups of 11881376 pairs took:25.197204113 seconds Memory after lookups: 347,680 K

      So, whilst it does use somewhat less memory than my BiMap version; is also somewhat slower.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        I finally managed to build a 64-bit version of Judy

        For the record, I finally managed to build a static 64-bit library, too. (Credit to BrowserUk and oiskuu.)
        oiskuu suggested that my problems with non-constant expressions arose because the compiler chose not to fold constant expression where undefined behaviour is involved - and that I should use (Word_t)0x100 instead.
        Doing that allowed the build to proceed, and 'make check' passed its tests.

        However, I was still seeing a number of "left shift count >= width of type" warnings, and I don't know how thorough the test suite (which completes very quickly) is.
        I therefore don't have a lot of confidence that the library is going to behave reliably in the real world.

        With the Judy-0.41 patches that anonymonk posted I was able to successfully build the module on 32-bit perls.
        However, the Judy module that I built against the 64-bit library loaded ok, but kept crashing during the running of its test suite.
        In Judy.xs, I changed the hard coded "long" casts to "long long" casts. That got rid of lots of warnings during the build, but the crashing persisted.
        The generated Judy.c file still contained some casts to "unsigned long int" , but I couldn't immediately find what was generating those casts - at which point I totally lost interest.

        Cheers,
        Rob
        C:\citrusperl\mingw\msys\bin\sh.EXE configure --enable-static --disable-shared prefix=C:/dev/judyjudy/judyprefix
        C:\citrusperl\mingw\msys\bin\make.EXE install


        Have you built a 64-bit Windows build of Judy-1.0.5 ?
        I had no trouble with 32-bit Windows builds, but with the 64-bit build I get lots of the following warnings:
        warning: right shift count >= width of type [enabled by default]
        and
        warning: left shift count >= width of type [enabled by default]
        Pretty early on, make reports the following errors:
        JudyLInsArray.c:137:5: error: initializer element is not constant JudyLInsArray.c:137:5: error: (near initialization for 'subexp_mask[5] +') JudyLInsArray.c:138:5: warning: left shift count >= width of type [ena +bled by default] ~cJU_POP0MASK(6), ^ JudyLInsArray.c:138:5: error: initializer element is not constant JudyLInsArray.c:138:5: error: (near initialization for 'subexp_mask[6] +') JudyLInsArray.c:139:5: warning: left shift count >= width of type [ena +bled by default] ~cJU_POP0MASK(7), ^ JudyLInsArray.c:139:5: error: initializer element is not constant JudyLInsArray.c:139:5: error: (near initialization for 'subexp_mask[7] +')
        which, of course, soon leads to:
        make[3]: *** [JudyLInsArray.lo] Error 1 make[3]: Leaving directory `/c/_64/comp/judy-1.0.5/src/JudyL' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/c/_64/comp/judy-1.0.5/src' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/c/_64/comp/judy-1.0.5' make: *** [all] Error 2
        I had a quick but unsuccessful attempt to google up something about these errors.
        Do you happen to know what the fix is ?

        Cheers,
        Rob
        ... ought to work on strawberry provided you have msys ...

        Thanks for your efforts.

        Unfortunately, getting a completely different version and flavour of perl + mingw + msys into my customer's set up is a fight I just don't need.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re^2: Bidirectional lookup algorithm? (Solution.)
by bitingduck (Chaplain) on Jan 24, 2015 at 06:50 UTC

    Perfect hashing/Minimal perfect hashing seems like it might both reduce the space requirement and speed up the lookups.

    However, generating such functions is non-trivial if you start from scratch...

    This may not be as horrible as it sounds if your code is going to get a lot of use. Minimal Perfect Hashing looks like the only way you're going to get O(1) on lookups, and if you're going to be getting bigger and bigger data sets, the time up front starts to look more cost effective. A bit of digging around found a real algorithm that someone actually implemented in C and tested, designed for use with both integer keys and character keys. Unfortunately I've only found the original paper and a later one with pseudocode, nothing quite canned.

    The paper with pseudocode is here: Czech, Havas, Majewski, and an earlier paper that's a little denser to interpret is here Havas, Majewski. This page: Schnell gives a pretty readable interpretation of it as well. It doesn't look too painful to implement.

    I also found what looks like a similar algorithm that appears to include bi-directionality implemented in a python package call pyDAWG that you could either call or port.

    A little more digging around on the relevant names might find you some c code that does about the same

    EDIT: allegedly there's a Java package floating around called "ggperf" that's an implementation of the CHM algorithm (and is supposed to be faster than a "gperf" minimal perfect hash generator that's part of libg++) but I couldn't find source, just a paper that amounts to documentation for ggperf

      bitingduck thankyou for your research. The links provided for fascinating, (if at times somewhat bewildering :) reading.

      In particular the Schnell article made the CHM algorithm understandable; and the pyDAWG code is relatively concise and understandable.

      I think I understand the process (if not the implementation) enough to see a problem (for my application).

      Given that my datasets are generated at runtime, and the goal is to be able to make them as large as available memory will allow (but no larger), it would be necessary to run the CHM algorithm within the same process, and after the dataset has been generated.

      The problem is, that even using the most parsimonious implementation of a DAG, the combined size of the node and edge structures (with their required pointers) require substantially more memory per string-number pair than the data itself. Using the pyDAWG implementation as an example; each pairing requires two DAWGNodes at 18bytes each; and each word required one DAWGEdge struct per letter (say ave. 5*9 + 8*9 = 117bytes ).

      Thus a quick estimate is that for each 13 bytes of data (5-byte string + 8-byte integer); the algorithm would require a minimum of 117+38 = 155 bytes of structs * no_of_pairs, in order to build the DAG.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        That's interesting how much extra space it takes to build the structure, and possibly part of why such algorithms aren't already coded up in a lot of libraries, given that the CHM algorithm dates back to 1992. I was also curious if the setup time would be too much - it wasn't clear that it really would do the generation in O(M+N) time with real data sets.

        It was interesting research-- I read through the whole thread and got curious why there aren't more canned packages that do minimal perfect hashing, given what it seems the value could be for some modern applications. I'm still not sure I completely understand why there aren't (though it's probably the memory cost vs. time cost). Then it kind of took me down a rabbit hole of interesting reading. My first inclination was to do something with trees and indexed lists of child nodes- it's more space intensive than a pair of hashes, but probably a little faster than a binary search.

Re^2: Bidirectional lookup algorithm? (poor XS solution)
by shmem (Chancellor) on Jan 11, 2015 at 20:27 UTC

    That line below your signature is somewhat enigmatic:

    The code present a class:BiMap, which supports the following methods: 7 ) { strncpy( (char*)( iHash script for RAD and testing), and a brief description. , CLEAN_AFTER_BUILD =used, newSize ); for( i = 0; i = pair; bm--1

    I've tried setting up a bidirectional hash using XS code. It makes both key and values into keys and misuses the IV slot of SV of a hash entry to store the pointer to the crossover hash key.

    However, the benchmarks are disappointing:

    use blib; use Hash::Bidirectional; use Benchmark qw(cmpthese); use Devel::Size qw(total_size); use strict; use warnings; my $hn = {}; my $hb = bless {}, 'Hash::Bidirectional'; my $i = 0; for ( 'aaaaz' .. 'zzzzz') { $i++; $hn->{$_} = $i; # - normal hash $hn->{$i} = $_; # / $hb->pair($_,$i); # - bidirectional hash } print "size normal hash: ", total_size($hn),"\n"; print "size bi_dir hash: ", total_size($hb),"\n"; cmpthese( 10000000, { normal => sub { $hn->{$hn->{shmem}} }, bi_dir => sub { $hb->get($hb->get('shmem')) }, }); __END__ size normal hash: 3025679168 size bi_dir hash: 2360323536 Rate bi_dir normal bi_dir 2801120/s -- -62% normal 7407407/s 164% --

    Improvement of 22% in memory and more than double time for lookup? Why? Is it due to method call/XS overhead? XS code below.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
      That line below your signature is somewhat enigmatic:

      It's an automatic appending of a collection of random snippets from the body of the post. It happens quite often when I post here. I try to remember to check for them and remove them; but sometimes I forget.

      I'm told (by the PMDevs) that it is due to the browser I use -- Opera -- but the fact that it never happens on any other site I use makes me think it is more to do with PM than Opera; but its always easier to blame someone else's tools than fix your own code.

      more than double time for lookup? Why? Is it due to method call/XS overhead?

      I would assume so. Any XS code is always going to be at a disadvantage compared to built-ins; that's inevitable. Which means you really have to parr everything down to even begin to compete.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        I'm told (by the PMDevs) that it is due to the browser I use -- Opera -- but the fact that it never happens on any other site I use makes me think it is more to do with PM than Opera; but its always easier to blame someone else's tools than fix your own code.

        :) do you got any js in your free nodelet?

        I've been using perlmonks since 2001 and this appending of random snippets hasn't happened to me, but then I don't use opera :/

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1112899]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-26 05:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found