Using tie to initialize large datastructures

htoug has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Using tie to initialize large datastructures by ariels (Curate) on Aug 08, 2001 at 10:56 UTC
You could `tie` a hash just as you described. This will give your structure a hash-like interface, which is either a pro or a con. Another way would be to use Dominus' excellent Memoize module. I wrote a review of the module: read the review of the Memoize module. Using it could be as simple as writing a function `get_data_structure_from_database`, saying `memoize 'get_data_structure_from_database'`, and continuing to use that function!	[reply] [d/l]
Re: Re: Using tie to initialize large datastructures by da (Friar) on Aug 08, 2001 at 20:54 UTC
I would like to add a second vote for `use Memoize;` your lookup functions are expensive and read-only, exactly what memoization is good for. In addition to the behavior ariels mentions above, I'd like to point out it is easy to persistantly store the memoized data using `tie`, no muss, no fuss. `use DB_File; tie my %cache => 'DB_File', $filename, O_RDWR\|O_CREAT, 0666; memoize 'function', SCALAR_CACHE => [HASH => \%cache];` [download] In addition, Memoize makes a good profiler-- details are available here in Dominus's article. `___ -DA > perl -MPOSIX -e'$ENV{TZ}="US/Eastern";print ctime(10**9)' Sat Sep 8 21:46:40 2001` [download]	[reply] [d/l] [select]
Re: Using tie to initialize large datastructures by Zaxo (Archbishop) on Aug 08, 2001 at 11:20 UTC
It would be helpful if you posted some code and example data. This sounds as if you read the entire database into memory. If so, that will certainly slow things down. Are you using DBI.pm to read the data? A few suggestions: Look into cached database connections Consider cached, prepared select statements with placeholders. Look at what DBI::bind() can do for you. Get ruthless with globals, replace them with lexicals which only hold data you need. Benchmark. No rule of thumb can replace actual performance measurements. A `tie` class may be useful, but a few `sub` returning data, given a key, is likely to work as well. Of course, all this is speculative and not necessarily useful. After Compline, Zaxo Update: Changed list to numbered format for reference. Thanks for the extra info on your requirements. I'm sorry to admit that I'm unfamiliar with any of the mechanisms you cite (Apache::ASP, CORBA through the COPE modules). You don't appear to use either CGI.pm or DBI.pm. (Update²: I'm informed that htoug uses DBI.) You might try #5 'Benchmark' right away, to see where the resource hogs are. Given your security requirements, #4 is all the more important. As a design issue, I'd suggest starting from the user interface and seeing how few SQL statements you need to support it.	[reply]
Re: Re: Using tie to initialize large datastructures by htoug (Deacon) on Aug 08, 2001 at 12:44 UTC
I'm definitly not trying to read the entire database into memory. That would take about 40GB, and we don't have that much available for each apache process! The system is a 3 tier system, with an apache frontend (written using Apache::ASP, handling the formatting of data, session handling etc), a set of application servers (communicating with apache using CORBA through the COPE modules), and the database (about 40 GB of data in ~800 table, the largest containg >130mill rows, access using DBI, DBD::Ingres {which I wrote} etc) - all on different machines. The database contains very sensitive data, so security is important. We have some (about 50-100) table that contain things like eg zip-code, department addresses, typecodes, and so on ad nauseam. Some are small, some are big, others huge - it varies. In the frontend code (on apache) we eg. need to create selectboxes, that let the user choose between different options, based on the content of the constant tables. A possibility would be to fetch the data everytime it is needed: `my $zip = $zip_server->get_zip_codes(); print "selectbox-header"; for (@$zip) { print "selectbox line"; } print "selectbox-end";` [download] or something like that. This will take quite a while and soon you discover the need for caching the data. So you try something like: `...in common initialisation code... our $zip; $zip=$zip_server->get_zip_codes(); ...where the zip-code is needed... print "selectbox-header"; for (@$zip) { print "selectbox line"; } print "selectbox-end"` [download] This is fast, but it takes more and more memory as the number of constants rise. So the next version could be something like: `...in the common initialisation code... our $zip; sub zip_init { $zip = $zip_server->get_zip_codes() unless $zip; } ...at every use... zip_init(); print "selectbox-header"; for (@$zip) { print "selectbox line"; } print "selectbox-end";` [download] That is fast, easy and does not comsume unnessacry amounts of memory. the downside is that you have to remember to call the zip_init before you use $zip. Sometimes you forget, and spend an excessive amount of time scratching your head and trying to fathom what went wrong. So I would like something like: `...in initialisation section... our $zip; tie $zip .... # magic here sub ZIP::TIE::FETCH { # smoke and mirrors here $zip = $zip_server->get_zip_codes(); untie $zip; # and leave the data in $zip } ..and where we use it... print "selectbox-header"; for (@$zip) { print "selectbox line"; } print "selectbox-end";` [download] Note no zip_init, fetch calls. Just the plain ordinary access to a variable. At the first reference to the variable the tie magic clicks in and retrieves the data, and removes the magic, leaving the 'naked' variable. Giving no need the remember the initialisation incantations (we all forget things too often) no performance overhead Did that clarify what I need?	[reply] [d/l] [select]
Re (tilly) 3: Using tie to initialize large datastructures by tilly (Archbishop) on Aug 08, 2001 at 16:40 UTC
I think you are trying to solve the wrong problem. First of all, gratuitous globals is a sign of a poor design. I would use an access function, and (depending on what made sense) I would have it memoize results. Much cleaner design, and your issue never arises. Unless your program is truly performance sensitive (the odds are very low that it is), trying to optimize before hand at the expense of maintainability is a losing game. However the second issue is technical. In the middle of calling an implementation of a tie, you don't have access to information about the tie. A tie just replaces a data structure with a wrapper around an object. But from the point of view of the object call, it is just an object call. You are not told what variable you are being called with, and said variable may not even be in any scope you can access. (Think about tying a lexical variable.) Now the technical issue I could find a hack around. But the maintainability issue makes me really not want to...	[reply]
Re: Using tie to initialize large datastructures by clemburg (Curate) on Aug 08, 2001 at 17:18 UTC
Would something like this suite your needs (demo just for scalar variables)? Approach: tie() builds up a mapping between tied objects and fully qualified subroutine names. When the tied object is asked for its value, we call the subroutine with the name passed when tie()ing the object. This subroutine caches the object's value. Benefits: Existing code can stay the same, you just need to tie() the global variables. Existing init routines can be reused. If you pass in a fully qualified name for a package global or a reference to a lexical variable as an additional argument to the FETCH method, the tie to this variable will be undone after first use. (Thanks to tilly for pointing out this works with lexicals, too!) Update: I see you called for something more - untie()ing the object after first use. For lexical ("my") variables, I don't currently see how to do this, since we have no access to them inside the FETCH function (Ah, we can have that - thanks tilly again - see above). For true package globals, it's easy: just set the (in this case scalar) entry of the glob in the package you call the FETCH from to the new value and remove the tie(). Hm ... let's see if this works ... yup it does! Demo code: #!/usr/bin/perl -w use strict; # show how to tie scalars existing init routines the lazy way $\| = 1; # -------------------------------------------------- package LegacyRoutines; use vars qw($AUTOLOAD); sub foo { print __PACKAGE__ . "::foo magically called\n"; return 42; } sub baz { print __PACKAGE__ . "::baz magically called\n"; return "hooray"; } # no bar routine here - catch errors sub AUTOLOAD { "LegacyRoutines: undefined subroutine $AUTOLOAD called\n"; } # -------------------------------------------------- package MyGlobals; # global to map objects to associated init routine names my %mappings; # global to memorize package globals to initialize my %vars; sub TIESCALAR { my $class = shift; my ($name, $var) = @_; bless \ (my $self), $class; $mappings{\$self} = $name; $vars{\$self} = $var; return \$self; } sub FETCH { print __PACKAGE__ ."::FETCH called\n"; # $_[0] - alias to original object ref we stored in %mappings my $value; if (not defined ${$_[0]}) { print "Initializing $_[0] ... \n"; # check if we have an entry for that object if (not exists $mappings{$_[0]}) { print "No matching subroutine for ", $_[0], "\n"; return $_[0]; } # call to init routine associated with $self no strict 'refs'; # set original value ${$_[0]} = &{ $mappings{$_[0]} }(); # remember it $value = ${$_[0]}; # untie package global if (exists $vars{$_[0]}) { untie ${$vars{$_[0]}}; } return $value; } return ${$_[0]}; } sub STORE { # whatever you want } sub DESTROY { # whatever you want } # -------------------------------------------------- package main; use vars qw($foo); tie($foo, "MyGlobals", "LegacyRoutines::foo", "main::foo"); tie(my $bar, "MyGlobals", "LegacyRoutines::bar"); tie(my $baz1, "MyGlobals", "LegacyRoutines::baz"); tie(my $baz2, "MyGlobals", "LegacyRoutines::baz"); # make $baz2 a de-facto alias to $baz1 print $foo, "\n"; print $foo, "\n"; print $bar, "\n"; print $baz1, "\n"; print $baz1, "\n"; print $baz2, "\n"; print $baz2, "\n"; [download] Christian Lemburg Brainbench MVP for Perl http://www.brainbench.com	[reply] [d/l]
You can hack anything... by tilly (Archbishop) on Aug 08, 2001 at 18:07 UTC
Write an initialization function that accepts two arguments. References to the variable you wish to tie and the function you want to provide its initial value. The initialization function then does a tie of the variable which passes as one of the arguments a reference to the variable you are tying. Now you have access inside the FETCH routine to the untie logic. Note, though that I would avoid this solution. To me using so many globals that initializing them all takes too much memory is the real problem, and finding ways to enable that mistake to be extended is worse than fixing the mistake...	[reply]
Re: You can hack anything... by clemburg (Curate) on Aug 08, 2001 at 18:51 UTC
I am definitively with you on the point of using so many globals. I would never consider to do something like this on many global variables just because of efficiency concerns. A design that needs such hacks is probably flawed. OTOH, I found this to be an interesting problem with respect to tie() usage. As for your suggestion - thanks! This really works. Funny. You can even use the code like it stands. Just pass in a ref to the lexical, and you're done. Like this: `my $baz3; tie($baz3, "MyGlobals", "LegacyRoutines::baz", \$baz3); print $baz3, "\n"; print $baz3, "\n";` [download] Christian Lemburg Brainbench MVP for Perl http://www.brainbench.com	[reply] [d/l]
Re (tilly) 2: You can hack anything... by tilly (Archbishop) on Aug 08, 2001 at 20:40 UTC
Why are you even bothering to do it that way?!? by dragonchild (Archbishop) on Aug 08, 2001 at 18:30 UTC
I'm failing to see the compelling reason for designing your system in this fashion. I would look at doing something that has the following characteristics: Is encapsulated. You (the requesting script) do not know how the thing does what it does. All you care about is that it does what it promises to do, which is retrieve your data. Is fast. You want it to give you the data you request in a minimum amount of time. Is fast. You want it to load in a minimum amount of time. Is small. You want it to use the least amount of memory. Sounds pretty tough, huh? Well, it's not. What you are looking for is not a datastructure, but an object. YAY-US! You, too, can be a part of the O-O revolution, my friend! You can be HEE-ULLED of your pro-see-ju-rull ways! What you're looking for is not something that loads all your data at once. That is waaay too slow to load, as I'm sure you've noticed already. You're looking for something that will cache data. Now, others have suggested using DBI's caching, and that's good, or some sort of memorize, and that's good, too. I'm suggesting a third method, and that is to write an object that will hide your data-retrieval methods from yourself. The basic concept is this - you instantiate this object. Then, when you need some data, you ask it for that data, and only that data. It will then check to see if it has it. If it doesn't, then it will go out to the database, get the data, store it within itself, then give it to you. Now, if you ask for that data again (for whatever reason), you will get the data immediately. You don't store the data ... this object does. This method immediately allows for three things: You get rid of all those nasty globals. Now, all you have is a file-scoped lexical (the object) that will handle all your data needs. You can request the same data over and over and not incur a performance penalty. This means that your logic flow is cleaner and clearer. Your routines are more loosely coupled. (This is a good thing, in case you're wondering.) If you have more than one script that uses these data structures and, because you will, you end up changing them, you only change stuff in one place! Think about that - maintenance is made 10x easier. I know I always like that. Now, you're gonna say "Well, I wrote the object, so I'm storing the data. You're just making my life more complicated." My answer is simple - "No. You are the script that needs the data, or the general. The object is someone else, a quartermaster if you like. Even though the general puts the quartermaster in his position, he still has to requisition supplies through a known and agreed-upon method." ------ /me wants to be the brightest bulb in the chandelier! Vote paco for President!	[reply]
Re (tilly) 1: Why are you even bothering to do it that way?!? by tilly (Archbishop) on Aug 09, 2001 at 09:22 UTC
Someone who is aware of how to write tie implementations had better be aware of how to write an object. And someone who wanted to avoid tie for performance reasons is going to be unlikely to want to use an object in the same place since the majority of the slowness in tie is in the method lookup. Note that Perl 5.8 is supposed to do a lot to fix the issue, but current versions of Perl have a performance headache while running OO code. (Not that that is normally an important thing to factor into a decision about whether or not to use an OO design...)	[reply]
The proof is in the pudding, my friends by dragonchild (Archbishop) on Aug 09, 2001 at 17:19 UTC
Someone who is afraid of the performance penalties for using the "best" algorithms is someone who, in my humble opinion, is suffering from premature optimization. Until you have the system fully up and running and have run benchmarks and heard user complaints, you cannot know that method A is too slow! All you have is theory and, you know what? The best theory and $3.29+tx will get you a grande cafe mocha. ------ /me wants to be the brightest bulb in the chandelier! Vote paco for President!	[reply]
Re (tilly) 1: The proof is in the pudding, my friends by tilly (Archbishop) on Aug 09, 2001 at 19:01 UTC
Re: Using tie to initialize large datastructures by mattr (Curate) on Aug 08, 2001 at 17:06 UTC
I don't understand the part about forgetting to call zip_init. But I tend to agree with tilly about using a separate function. It seems that memoize's cache over tied hash would work, or maybe you'd like to periodically dump your db into Tie-MmapArray files, which seems to resemble your request of an auto init function. You'd pick which pre-prepared file to tie in after getting user input, and voila your memory is useable.	[reply]