Re: Data structure challenge
by flyingmoose (Priest) on Mar 17, 2004 at 18:26 UTC
|
Seems interesting, but you kind of lost me. I'm easy to lose sometimes, so don't take this as a knock on your post.
IMHO, this would be more understandable if you used more terms found in the data structure references. Initially, your datastructure looks like it's just pigeon-holing data (unrelated, but see bucket sort), but your example of the 'special' structure is a little confusing.
Terms like stack, queue, dequeue, trie, red-black-tree, list, etc -- are all there to facilitate easier communication. Anyhow, as it stands, you've lost me. Does this 'special structure' have a name? What are you trying to emulate? I can sort of grok the intent, but an explanation of the workings and principle is probably in order.
As it stands, all classical operations being O(1) is only doable when your datastructure is as large as the domain space, right? In many cases, this means working on small data sets or with huge amounts of data. Result? It's not used very frequently. I understand this is just a question to see if it's possible, but where would this be used and is there a space to fit it? If one datastructure fit everywhere, we wouldn't have the cool montage of data-structures, of course, so I guess I've made your point :)
Also, the cost of a hash function that maps to a completely unique value (as noted in my previous paragraph) is expensive both in time and space. Maybe Perl is cool and doesn't make an array as wide as it is and can leave them sparse...but can it really? Not sure. Hence this is why hashes collide and tend to be a little slow -- but the hash function is less expensive since collisions are allowed (not speaking Perl, but stuff in general) -- and collapsing down to a smaller value is important.
| [reply] |
|
|
| [reply] |
|
|
| [reply] |
Re: Data structure challenge
by gjb (Vicar) on Mar 17, 2004 at 18:11 UTC
|
Trouble is caused by initialisation, so what data structure to use that is initialized in constant time? The answer may be to use simple strings for A and B. The length should be U*log_10(U) so that each group of log_10(U) characters represents an "array element". Initialisation can be done using pack that should be constant time, array access can be simulated using substr which is also constant time.
Just my 2 cents, might be totally wrong, -gjb-
| [reply] [d/l] [select] |
|
|
Initialisation can be done using pack that should be constant time
Well, that would be interesting. I'm not a pack() wizard,
and I can't see how I would do it with pack. What format
would I use?
Abigail
| [reply] |
|
|
Wouldn't a pack(a[U*log_10(U)], 'z') do the trick?
Thanks Aristotle, I was too lazy/huried to check the correct syntax. The length of the string we want to allocate is N = U*ceil(log_10(U)), so I hoped pack(aN, '') would create a string of N '\0' characters using some low level memory allocation function. Unfortunately that doesn't seem to happen, the resulting string consists of N spaces, so there goes constant time... :(
Just my 2 cents, -gjb-
| [reply] [d/l] [select] |
|
|
Re: Data structure challenge
by BrowserUk (Patriarch) on Mar 17, 2004 at 18:36 UTC
|
| [reply] [d/l] |
|
|
If amortizing the initialization cost over the lifetime of the structure were allowed, then simple Perl arrays could be used (by not preallocating the size of the array so that the initial creation is still O(1)). Your use of strings 'suffer' from this same 'problem'. So I'm pretty sure Abigail would reject this solution.
Your solution would qualify if you could preallocate the length of the string w/o Perl initializing the contents of the string. This is pretty trivial to do in XS, hence the disqualification of that method.
| [reply] |
|
|
I'm not allowed to amortize the initialization over the life of the structure: but what about over the life of the program: is it valid (in the context of this challenge) big array (vec, whatever) at time zero and then manage it using my own allocator?
| [reply] |
|
|
|
|
|
|
Just as tye says, the problem is in the initialization.
If you assign a value in a vec() string, and it's "out of
bounds", perl will zero all the in between values.
That means, that if you were to store just one value, say
U - 1, that insertion will cost you
Θ (U) time. Which isn't O (1).
Abigail
| [reply] |
|
|
I thought it was too easy.
So the whole challenge comes down to is there any way to allocate/extend a piece of memory without initialising it, using pure perl?
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] |
Re: Data structure challenge
by BrowserUk (Patriarch) on Mar 17, 2004 at 21:53 UTC
|
use Benchmark::Timer;
$T = new Benchmark::Timer;
$T->start( 'autoviv' );
$auto[ $_ ] = 1 for map{ 10**$_ } 0 .. 7;
$T->stop( 'autoviv' );
$#pre = 10**7;
$T->start( 'prealloc' );
$pre[ $_ ] = 1 for map{ 10**$_ } 0 .. 7;
$T->stop( 'prealloc' );
$T->report;
1 trial of autoviv (93.750ms total)
1 trial of prealloc (15.625ms total)
That still doesn't acheive O(1) as the number of links in the list is still N, though it is somewhat quicker than autoviv.
That said, when initialising 16 MB of scalar to spaces only takes 36 milliseconds, this is one of those cases where big O notation hides reality I think.
use Benchmark::Timer;
$T = new Benchmark::Timer;
$T->start( 'scalar'), $x = ' ' x 2**24, $T->stop( 'scalar' ) for 1 ..
+10;
$T->report;
10 trials of scalar (359.375ms total), 35.938ms/trial
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] [select] |
Re: Data structure challenge
by BrowserUk (Patriarch) on Mar 18, 2004 at 14:36 UTC
|
There is an out-by-one error in the logic somewhere that I cannot find right now, but otherwise, I think this complies with the rules of the challenge.
#! perl -slw
use strict;
local $, = ' ';
my( $A, $C, @B ) = ( undef, 0 );
open SPARSE, '+> :raw', \$A;
sub readA{
seek SPARSE, 4 * $_[ 0 ], 0;
sysread SPARSE, my $n, 4;
unpack 'N', $n;
}
sub writeA{
seek SPARSE, 4 * $_[ 0 ], 0;
syswrite SPARSE, pack( 'N', $_[ 1 ] ), 4;
}
sub query{
my $i = shift;
my $a = readA( $i ) || -1;
return 0 unless 0 <= $a and $a < $C;
my $b = $B[ $a ];
return $b == $i ? 1 : 0;
}
sub insert{
my $i = shift;
if( !query( $i ) ) {
$B[ $C ] = $i;
writeA( $i, $C );
$C++
}
}
sub del{
my $i = shift;
if( query( $i ) ) {
my $a = readA( $i );
my $b = $B[ --$C ];
$B[ $a ] = $b;
writeA( $b, $a );
}
}
sub clear{
$C = 0;
}
my @numbers = map{ int rand( 2**16 ) } 1 .. 10;
print 'Query before insert';
printf "%5d : %s\n", $_, query( $_ ) for @numbers;
insert $_ for @numbers;
print "\nQuery after insert";
printf "%5d : %s\n", $_, query( $_ ) for @numbers;
del $_ for @numbers[ 0..4 ]; ## Delete half
print "\nQuery after deleting half";
printf "%5d : %s\n", $_, query( $_ ) for @numbers;
del $_ for @numbers; ## Delete them all regardless if they still exist
+s
print "\nQuery after deleting the rest";
printf "%5d : %s\n", $_, query( $_ ) for @numbers;
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] |
|
|
Update: this is stupid, see BrowserUk's reply on why.
I would not think this is constant time. I know nothing about how spares files are implemented, but I would not think it is such that you can access any byte in constant time.
(If the filesystem does not have sparse files, the situation is even worse, as the os must fill the file with zero bytes when seeking past end.)
| [reply] |
|
|
Okay. Forget the bareword SPARSE, it is just a filehandle, opened (R/W) to an in memory file per perlfunc:open (5.8.x):
File handles can be opened to ``in memory'' files held in Perl scalars via:
open($fh, '>', \$variable) || ..
My inspiration was that if you seek on an in-memory file, the string is extended, without being initialised. Of course it isn't if it is going to emulate a file. If you seek to a position in a random access file you don't want everything preceding it to be overwritten.
So, a pure-perl way of allocating a buffer without initialisation.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] [select] |
|
|
|
|
|
Re: Data structure challenge
by dragonchild (Archbishop) on Mar 17, 2004 at 21:22 UTC
|
What about using soft references? I used BrowserUk's harness and changed the engine around a little.
Now, I think that doing this is cheating, a little, because I'm using the symbol table, which is a hash. But, I don't think this is possible in pure Perl, not with the constraints you've placed on it.
------
We are the carpenters and bricklayers of the Information Age.
Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.
| [reply] [d/l] |
|
|
I came up with something similar using a hash to emulate a sparse array
But then realised that I might as well just use the hash to start with and save all the bother. Besides, the hash scalaes better. The overhead of allocating 4GB of ram in order to allocates 3 unique 32-bit integers (for example) is far higher than hashing those 3 integers.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] |
Re: Data structure challenge
by toma (Vicar) on Mar 18, 2004 at 06:26 UTC
|
I am not offering this as a solution to your
challenge, but I suggest looking at the thread
Compressing a set of integers. I think that I had a similar
problem, but I had a few more opportunities
for optimization than you have presented.
With help from the others in the thread, I was
able to use pack to make a really cool data structure
that is similar to a sparse bit vector. My problem had
the potential optimization that some bits were more
likely than others, and I could reorder.
For the problem I was trying to solve,
Table::ParentChild ended up getting
written along the way.
For the deployed application, I
ended up using a relational database, which
performs well enough today in production.
I also suggest, for practical solutions, that you
consider algorithms which are O(log(n)),
especially those with small constant factors.
It should work perfectly the first time! - toma
| [reply] |
Re: Data structure challenge
by iburrell (Chaplain) on Mar 17, 2004 at 19:33 UTC
|
With Perl, you can make initializing and clearing constant by using autovivication with arrays. Allocate the @A as an empty array. Setting any value will expand the array and set the other values to undef. This is O(N) but it is fast. Clearing is emptying out the entire array. It also has the advantage that the size of the array is the largest value used. Your data structure uses autovivication for the A array so it has the same performance but more overhead.
It is also important to realize the importance of the constants to running time. Preallocating the array pays O(U) cost in initialization but makes each operation more efficient. When inserting N items close to U, this can be significant.
| [reply] |
|
|
Setting any value will expand the array and set the other values to undef.
Well, yes, but that exactly what the challange wants you to avoid.
You don't want to spend time creating undefined
values.
Your data structure uses autovivication for the A array so it has the same performance but more overhead.
No. It's important to realize I did not describe the A array in Perl terms. Think C, or even better, assembler.
Just declare an array as a chunk of memory where integers
are stored. No initialization of the elements
allowed - just accept whatever values are in memory.
Abigail
| [reply] |
Re: Data structure challenge
by ambrus (Abbot) on Mar 18, 2004 at 07:59 UTC
|
This is an interesting question.
If I read you correctly, it is like this:
you could get an only slightly slower solution
by using a hash, or better, a binary tree instead
of the uninitialized array A
(with a binary tree, you get O(log n) instead of O(1) for each operation).
In fact,
this is true no matter how you use an uninitialized array,
that is, you can always use a hash or a binary tree instead
of one.
Notice however, that if you use such a data structure only once,
and you use the efficent C implementation,
the OS still has to initialize the array when the process
gets the memory by calling brk or mmap.
| [reply] |
|
|
with a binary tree, you get O(log n) instead of O(1) for each operation
Well, only if the binary tree is balanced (and ordered).
I am very well aware of the performance levels of a binary
tree - if I wanted to settle for O (log n) performance,
I wouldn't have posted the challenge. The challenge is to
do all operations in O (1) - doing one or more in O (log n)
is not a challenge.
Notice however, that if you use such a data structure only once, and you use the efficent C implementation, the OS still has to initialize the array when the process gets the memory by calling brk or mmap.
While some C libraries might initialize the memory claimed
with mmap or brk, there's no reason they have to. The OS and/or the C libraries can always be patched that one can
claim a segment of memory, and taking it "as is". For the
scope of the challenge, you may assume that we run on a platform where memory is given out "as is".
Abigail
| [reply] |
|
|
When saying binary tree, I thought of a trie actually. I know it's still O(log n) so does not satisfy the
conditions, but a binary trie's code is not as complicated as that of a balanced binary tree.
As for initializing memory: the C library certainly won't initialize the whole array when calling
malloc. However, on multi-user OS, the system has to
initialize all the pages allocated by a process, for security
reasons. If it would not do that, you could get vulnerable
informations by allocating much memory and searching for
daat that other processes freed but hav not zeroed.
This does not take much time because 1. malloc
tries to reuse the memory freed by a process,
so it doesn't have to ask for a new chunk evry time,
2. the os can do it in idle time, so you don't have
to wait for it when you allocate the memory, 3.
zeroing a chunk of memory is fast.
The first reason may not count with very large arrays,
as malloc gets thoes with mmap, and when they're
freed they're munmapped.
The question you asked still holds however,
because using such a structure is good even in
such OS'es, as you may use it many time, and
you don't have to zero it evry time the same process
clears and reuses one. I
still don't know if it can be done in Perl.
| [reply] |
|
|
No, the OS doesn't necessarily have to initialize the array. Many compilers will do that, but all malloc() should be doing is saying "I will be using this space for a while". It's not guaranteed at all what will be there.
------
We are the carpenters and bricklayers of the Information Age.
Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.
| [reply] |
Re: Data structure challenge
by Anonymous Monk on Mar 17, 2004 at 20:50 UTC
|
Is this not impossible using the data structures available to us within Perl? If we do not initialize before using a structure, then perl will autovivify, which is nearly just as bad. So to make this work with any kind of array system you would need to bypass perl's autovivification methods. Of course, you could skip any type of complicated data structure and stick with manipulating scalars. Fixed-length fields perhaps? Or somthing silly like that ;)
| [reply] |
Re: Data structure challenge
by dragonchild (Archbishop) on Mar 18, 2004 at 13:31 UTC
|
I've been thinking about this. Obviously, this is do-able in C (and similar languages), as you have C-like pseudo-code in your challenge. But, other than sparse arrays1, what would be practical benefit of this data structure be? If you add in a hashing function to map strings to your set of i, you still have the collision issue.
Are you considering this as a possible implementation of the arrays Perl's hashes use? It would seem kinda wasteful to add an indexing array just to avoid clearing and initializing in O(n) ...
1 I say sparse arrays because in the 2-array method, U doesn't have to be the upper limit of all i. It simply has to be the upper limit of how many i you can have at one time. So, if you know you'll only ever have a max of 10 things, they could be in the set of 2**32, but you would only need 2 10-element arrays and a short as your C.
------
We are the carpenters and bricklayers of the Information Age.
Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.
| [reply] |
|
|
But, other than sparse arrays1, what would be practical benefit of this data structure be?
Worst-case constant time operations, regardless of how many elements there are in the data structure.
If you add in a hashing function to map strings to your set of i, you still have the collision issue.
Well, yes. Don't do that then. ;-) As I said, it only 'works' if you know certain things about your set - in this
case, you know you are only going to store non-negative
integers not exceeding U.
Are you considering this as a possible implementation of the arrays Perl's hashes use?
No.
I say sparse arrays because in the 2-array method, U doesn't have to be the upper limit of all i. It simply has to be the upper limit of how many i you can have at one time. So, if you know you'll only ever have a max of 10 things, they could be in the set of 2**32, but you would only need 2 10-element arrays and a short as your C.
In the 2-array method I described, it's essential U is the upper limit of all i, because i is used as an index in the array A.
Abigail
| [reply] |
Re: Data structure challenge
by bsb (Priest) on Mar 22, 2004 at 23:34 UTC
|
Does $#A = $#B = U initialize the arrays correctly?
Brad | [reply] |
|
|
Yes, Perl has to initialize all array elements.
Internally, it will store pointers (to SV's), and if Perl
wouldn't initialize the array elements, it ends up with
random memory addresses, and it has no way of finding out
whether the element was initialized or not.
Abigail
| [reply] |
|
|
As I thought about it, I began to suspect that.
XS of course doesn't count.
Does this cover virtually programming in C via
syscall(&SYS_mmap,..) and pack-ing data structures?
Otherwise, I give up. In fact, I give up anyway, I'm
not going near syscall.
| [reply] |