baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'll go strait to the point. I am looking for a solution (not the code) to encode 1000 integers (positive), spanning from 1-10^9 into integers from 1-1000 so I can place them into a simple array data structure. Therefore, in order to avoid reinventing a wheel I was wondering does anyone know a solution to this problem and is willing to share?

An obvious approach would be sorting the numbers first and then assigning them to array positions according to their rank values. However, sorting only can be an option if the key size (like in counting sort) is less or equal to the array size, or sorting itself is linear in time, proportional to the array size. I cannot think of any such solution, so here I am?

thank you

Replies are listed 'Best First'.
Re: Encoding function needed
by atcroft (Abbot) on Feb 08, 2015 at 10:53 UTC

    My first thought would be the log function (log(1) = 0, and log(1e9) = 20.7232658369464). Otherwise, you have 1000 ranges of size n .. n + ( 999_999 ), so you could just divide by 1_000_000.

    Hope that helps.

    Update: 2015-02-09
    Thanks to choroba for catching my error in linking to the log function documentation.

Re: Encoding function needed
by roboticus (Chancellor) on Feb 09, 2015 at 12:05 UTC

    baxy77bax:

    You might consider simply using a hash. You don't specify any requirements on how the mapping is done, so it's possible you don't actually care about the actual number you're mapping to. If so, then there's little difference between:

    my ($idx, @array, $a_big_num, $some_data); # Store the data $idx = map_func($a_big_num); $array[$idx] = $some_data; # Fetch the data $idx = map_func($a_big_num); $some_data = $array[$idx];

    and

    my (%hash, $a_big_num, $some_data); # Store $hash{$a_big_num} = $some_data; # Fetch $some_data = $hash{$a_big_num};

    The only real differences between them pop up if (a) you need to retain the order of insertion/deletion or (b) if you actually need to know the slot number ($idx) somewhere in your code. I'm guessing you don't need either, though, because you have absolutely no stated requirements on your mapping function that indicate it.

    In fact, you have *so* *few) requirements specified on your mapping function, that this one would meet all stated requirements:

    sub map_func { return 7 }

    Before you complain about the fact that there are collisions, you'd have to deal with that in any case because when you map a billion different integers into a thousand, your mapping function will always return collisions unless it's something like:

    sub map_func { state %indexes; state $idx; my $a_big_num = shift; return $indexes{$a_big_num} if exists $indexes{$a_big_num}; die "Ran out of slots!" if keys %indexes > 1000; return $indexes{$a_big_num}=$idx++; }

    But in this case, we're using a hash anyway, so there's little to be gained over just using the hash directly. (Again, assuming you don't need the aforementioned conditions (a) or (b).)

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Encoding function needed
by wollmers (Scribe) on Feb 09, 2015 at 13:00 UTC

    If your logic can work on a hash, I also suggest using a hash as a "sparse array". This also has the advantage that your indices are not restricted to positive integers. You can use anything that stringifies. Hashes are maybe a little slower than arrays, but with hashes you do not need the mapping and reverse mapping. In most cases the hash will be not (much) slower than map+array+remap.

    An example of sparse vector cosine calculation:
    Bag::Similarity::Cosine.

    An example of a matrix with mixed negative and positive indices:
    Align::Sequence::Ukkonen. See the "matrix" $F implemented as a hashref.

    Helmut Wollmersdorfer

Re: Encoding function needed
by Anonymous Monk on Feb 08, 2015 at 19:49 UTC

    What is the intended application? Are you willing to share?