Technically if you want to see if the hashref $hash has an element for key $var you want exists. Even so neither of those two examples should force a copy of %{$hash} (at least I don't believe so; however if there were multiple levels there such that you trigger autovivification (i.e. $hash->{'not there'}->{$var}) it would probably force a copy since you're implicitly changing $hash's contents). And if you really want read-only then Readonly will make it enforced not just by convention.
You might also want to consider using something like BerkelyDB or DBD::SQLite to access your data from disk rather than pulling it all into RAM.
Update: Or Cache::Memcache might be worth looking into as well.
Update: Duur, Readonly not ReadOnly. Thanks to superfrink for the catch.
The cake is a lie.
The cake is a lie.
The cake is a lie.
| [reply] [d/l] [select] |
Of those, BerkeleyDB is by far the fastest. SQLite does has the advantage of full SQL queries though.
| [reply] |
One thing you'll need to look for is places where a scalar's internal type might change. For example:
my $number = 10;
print "$number\n";
That second line is going to actually write to $number by changing it from a pure integer (IV) into a string/integer (PVIV). You can prove this by using Devel::Peek:
use Devel::Peek;
my $number = 10;
Dump($number);
print "Number is $number\n";
Dump($number);
Which outputs:
SV = IV(0x9a5cfe0) at 0x9a41774
REFCNT = 1
FLAGS = (PADBUSY,PADMY,IOK,pIOK)
IV = 10
Number is 10
SV = PVIV(0x9a42b10) at 0x9a41774
REFCNT = 1
FLAGS = (PADBUSY,PADMY,IOK,POK,pIOK,pPOK)
IV = 10
PV = 0x9a57758 "10"\0
CUR = 2
LEN = 4
If that variable was part of your shared pre-fork data then that data page just got unshared, all 4k of it! It's possible to avoid this in some cases, for example by using printf() instead of print() above, but it's quite difficult to find all the possible cases.
To learn more about how Perl manages memory check out perlguts.
-sam
| [reply] [d/l] [select] |
use Devel::Peek;
use Readonly;
Readonly::Scalar my $number => 10;
Dump($number);
printf "Number is %d\n", $number;
Dump($number);
print "Number is $number\n";
Dump($number);
Output:
SV = PVMG(0x68ba80) at 0x65eb40
REFCNT = 1
FLAGS = (PADBUSY,PADMY,GMG,SMG,RMG)
IV = 0
NV = 0
PV = 0
MAGIC = 0x662c90
MG_VIRTUAL = &PL_vtbl_packelem
MG_TYPE = PERL_MAGIC_tiedscalar(q)
MG_FLAGS = 0x02
REFCOUNTED
MG_OBJ = 0x63c410
SV = RV(0x6ba0d0) at 0x63c410
REFCNT = 1
FLAGS = (ROK)
RV = 0x6dc2e0
SV = PVMG(0x68ba48) at 0x6dc2e0
REFCNT = 1
FLAGS = (PADBUSY,PADMY,OBJECT,IOK,pIOK)
IV = 10
NV = 0
PV = 0
STASH = 0x6dc120 "Readonly::Scalar"
Number is 10
SV = PVMG(0x68ba80) at 0x65eb40
REFCNT = 1
FLAGS = (PADBUSY,PADMY,GMG,SMG,RMG,pIOK)
IV = 10
NV = 0
PV = 0
MAGIC = 0x662c90
MG_VIRTUAL = &PL_vtbl_packelem
MG_TYPE = PERL_MAGIC_tiedscalar(q)
MG_FLAGS = 0x02
REFCOUNTED
MG_OBJ = 0x63c410
SV = RV(0x6ba0d0) at 0x63c410
REFCNT = 1
FLAGS = (ROK)
RV = 0x6dc2e0
SV = PVMG(0x68ba48) at 0x6dc2e0
REFCNT = 1
FLAGS = (PADBUSY,PADMY,OBJECT,IOK,pIOK)
IV = 10
NV = 0
PV = 0
STASH = 0x6dc120 "Readonly::Scalar"
Number is 10
SV = PVMG(0x68ba80) at 0x65eb40
REFCNT = 1
FLAGS = (PADBUSY,PADMY,GMG,SMG,RMG,pIOK,pPOK)
IV = 10
NV = 0
PV = 0x6519b0 "10"\0
CUR = 2
LEN = 8
MAGIC = 0x662c90
MG_VIRTUAL = &PL_vtbl_packelem
MG_TYPE = PERL_MAGIC_tiedscalar(q)
MG_FLAGS = 0x02
REFCOUNTED
MG_OBJ = 0x63c410
SV = RV(0x6ba0d0) at 0x63c410
REFCNT = 1
FLAGS = (ROK)
RV = 0x6dc2e0
SV = PVMG(0x68ba48) at 0x6dc2e0
REFCNT = 1
FLAGS = (PADBUSY,PADMY,OBJECT,IOK,pIOK)
IV = 10
NV = 0
PV = 0
STASH = 0x6dc120 "Readonly::Scalar"
_____
Update: ...similarly with Scalar::Readonly, btw
(though here the internal structures are as lean as without using the module):
use Devel::Peek;
use Scalar::Readonly ':all';
my $number = 10;
readonly_on($number);
Dump($number);
print "Number is $number\n";
Dump($number);
SV = IV(0x661dc8) at 0x65eb50
REFCNT = 1
FLAGS = (PADBUSY,PADMY,IOK,READONLY,pIOK)
IV = 10
Number is 10
SV = PVIV(0x63e130) at 0x65eb50
REFCNT = 1
FLAGS = (PADBUSY,PADMY,IOK,POK,READONLY,pIOK,pPOK)
IV = 10
PV = 0x654fb0 "10"\0
CUR = 2
LEN = 8
| [reply] [d/l] [select] |
ouch, if even your example causes a copy, then there will be alot to be cleaned up ;)
| [reply] |
If you're on Linux, the Linux::SMAPS module can help you tell how much is shared. It's impossible to keep it all shared, but your hash access is certainly not helping. Avoiding conversions between strings and numbers is also good. | [reply] |
thanks, this sounds really interesting.
sadly there are no smaps-files on our server, which has kernel 2.6.8.
Do I need special kernel versions or options to use this? On my desktop (kernel 2.6.22) these files are available...
| [reply] |
| [reply] |
Blink.. blink.. Did you say, “‘loads’ 1.5 GB of data?‘ For good-heaven's sake... into where?!?!?!
When you're dealing with a volume of data like this, absolutely the worst approach in the known Universe is to “load it into ‘memory’ and then access it randomly.” As a matter of fact, your very first mistake is to even attempt to access such a volume of data “randomly” ... at all!
First of all, in any modern computer system, “memory” is virtual. In other words, memory is a cached disk file, and any attempt to consider it in anything other than those terms is preordained for failure.
Come back with me to the days of yesteryear ... to the earliest days of computers, in fact to the days before digital computers. Let me re-introduce you to a world where searching was not necessary, albeit for the very-obvious reason that it could not be done.
Take the Unix sort command and try to sort that 1.5 gigabyte file. It'll take a few minutes but you just might be quite surprised at how little time it actually takes. (Study the options of sort just a little more and you might slash that time in half.)
Once a file is known to be sorted, it does not have to be searched. At least, not for the sort-key. And if you needed to have different versions of that file sorted by different keys, well, that would probably take a lot less time than the approach you're banging your head against right now!
When you know that a file is sorted, then you always know two very-important things:
- All occurrences of any key-value occur together.
- Therefore, when the last occurrence of key-value “X” has been seen, there are no more.
- If any “gaps” occur, it can be conclusively stated that there do not exist any records which can fill that gap... anywhere. You know this, without searching.
Forget all of “this threading nonsense.”
I'm very sorry to say, re-think your entire approach! (Believe me, you'll be very glad that you did.) Think about it: what possible good will it do to subdivide the CPU's time in new-and-creative ways, when it can be quite-conclusively shown that it is only the speed of the I/O subsystem that has any conceivable relevance upon the solution-speed of this problem?
I'm serious: come to a full stop and read my posting over and over again about ten times. Then take the rest of the day off, come back Monday and read it another ten times.
Wait patiently until you say... “ahhhh-h-h-h-h...”
I mean absolutely no offense when I say... bring your sunglasses.
When that proverbial “little light” comes on, it is quite bright.
| |