in reply to Efficiency and Large Arrays

The bit about 'increment serial number until it's unique' throws a red flag for me. There are two solutions that come to mind. Either put things in a nice relational database (especially one that has a feature like auto-increment id column) or find some other sort of unique identifier.

If you're creating a new hash already, you can use its reference (yes, you read that right) as a unique ID. (I've seen this used as keys for a Flyweight object, pretty cool!) They're guaranteed to be unique, as they have something or other :) to do with memory locations:

]$ perl my $h_ref = {}; print "$h_ref\n"; HASH(0x80d761c)
You can get rid of everything except the hex digits with a simple tr/// statement: tr/a-f0-9//dc;. That's quicker than scanning for unique numbers.

Still, there's something I can't quite put my finger on here... perhaps you could show us your intended data structure?

Replies are listed 'Best First'.
RE: Re: Efficiency and Large Arrays
by fundflow (Chaplain) on Jul 23, 2000 at 03:22 UTC
    This is a real overkill, don't you think?

    Also, what happens when you run the script the second time?
    Who guarantees that the number you got (memory location) does not appear already somewhere else in the next records?

    If he renumbers anyway, then any number can do. Instead of using perl's heap pointer, it will be easier to explicitly pick one.

    Anyway, his problem seems to lie on the memory use more than the numbering scheme.
      Also, what happens when you run the script the second time?

      Yes, that's a problem, if the serial numbers need to be maintained. If this is a one-time-per-dataset operation, and the serial numbers are there just while manipulating the data, it doesn't really matter.

      Anyway, his problem seems to lie on the memory use more than the numbering scheme.

      But the reason he's keeping all the old records around is to make sure he doesn't reuse a number. If he uses a unique identifier (the reference value is unique, automatically generated, and readily available), he doesn't have to keep all of the records around in memory.

      The thing that bothered me was using grep to look for already-used phone numbers. What if they were the primary key of the hash? Then, it's a simple lookup to see if one's already used.

        Having a hash instead of grepping is of course better. (although it takes more memory)

        The idea of using the memory reference returned by perl's internal heap mechanism is interesting, but i'm not sure it buys much here.

        Anyway, the original post is "walking on the edge" of usability. If his files are much bigger than the computer memory, then the hash will not fit in and then there are better ways, such as using database, doing multiple passes etc. (or keeping the files clean in the first place...)

        Cheers.
Re^2: Efficiency and Large Arrays
by diotalevi (Canon) on Dec 12, 2002 at 14:51 UTC

    Instead of using the string form of the hash reference just take the numeric to begin with. If you wanted the hex form then just pack/sprintf it.

    ]$ perl my $h_ref = {}; print 0+$h_ref,"\n"; print unpack('H*',pack('L',0+$h_ref)),"\n"; 135099932 80D761C

    __SIG__ use B; printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE;