comment on

Introduction

It's commonly known that a hash is an appropriate datastructure if you want to store and modify a set of strings, and you want to perform matching queries, that is, quickly answer the question whether something is an element of the stored set.

There are some drawbacks about hashes. Accessing a key takes time, as we need to calculate the hash function. Furthermore, many elements could hash to the same address, which causes chains, and which slow down queries (and deletes). Furthermore, 'clearing' a hash takes time linear in the size of the stored set. (Some of the problems have been fixed in recent perls, but with some overhead when inserting elements). Implementation is also non-trivial.

If we know something about the set we are storing, we can sometimes use datastructures that are faster. For instance, suppose we know that all possible elements of the set are non-negative integers smaller than a fixed number U. We can then build an array (or a vec string) with U elements, say A, and we let i be a member of our set if and only if A [i] == 1. Otherwise, A [i] == 0. Using simple code, we have something like:

    use constant U => ....;
    sub init   {@A = (0) x U}
    sub insert {$A [shift] = 1}
    sub delete {$A [shift] = 0}
    sub query  {$A [shift]}
    sub clear  {@A = (0) x U}
[download]

It's easy to see that inserting, deleting and querying all take O (1) time. Unfortunally initializing or clearing the structure takes time &Theta (U).

A special datastructure

But we can do better. Consider the following arrays A and B, and an integer C:

       A              B

    +------+       +------+
U-1 |  ?   |       |  ?   |
    +------+       +------+
    .      .       .      .
    .      .       .      .
    .      .       .      .
    .      .       .      .
    .      .       .      .
    +------+       +------+
 4  |  0   |       |  3   |
    +------+       +------+
 3  |  4   |       |  1   |  <---- C (== 3)
    +------+       +------+  
 2  |  1   |       |  1   |
    +------+       +------+
 1  |  2   |       |  2   |
    +------+       +------+  
 0  |  2   |       |  4   |
    +------+       +------+
[download]

Now, a number n is in our query set, if and only if, all of the following conditions hold:

0 <= A [n] < C
B [A [n]] == n

So, the above set contains 1, 2 and 4. It doesn't contain 0 because A [0] == 2, and B [2] == 1. It doesn't contain 3, because A [3] >= C.

Now, how do we insert? Suppose we want to insert a number i. If it's already in the set (that is, the conditions mentioned above are fullfilled), nothing needs to be done. Otherwise, we store i in B [C], we store C in A [i], and we increment C.

Deletion is a bit trickerier, but still straightforward. Suppose we want to delete i. We do that by moving B [C - 1] to B [A [i]], as follows: first B [A [i]] = B [C - 1], then A [B [C - 1]] = A [i]. And finally we decrement C.

In code:

    int query (int i)   {0 <= A [i] && A [i] < C && B [A [i]] == i}
    void insert (int i) {if (!query (i)) {B [C] = i; A [i] = C; C ++}}
    void delete (int i) {if (query (i)) {
                             B [A [ i]] = B [C - 1];
                             A [B [ C - 1]] = A [i];
                             C --}}
[download]

Clearing the set is simple, all one needs to do is to set C to 0.

So, we can do querying, inserting, deleting, and clearing in O (1) time. Not bad. But what about initializing? Well, that can be done in constant time as well. At least, in low level languages. The key is that we start off with C being 0, so no matter what is already in A or B, it never causes elements to be present. So, we do not need to initialize the arrays. As long as we can claim memory of appropriate size.

Hence, we can do all five operation in constant time - as long as the programming language doesn't insist on initializing.

Challenge

Can we create a similar datastructure in Perl? XS of course doesn't count.

Abigail

In reply to Data structure challenge by Abigail-II

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.