Re: Using tie to initialize large datastructures
by ariels (Curate) on Aug 08, 2001 at 10:56 UTC
|
You could tie a hash just as you described. This will give your structure a hash-like interface, which is either a pro or a con.
Another way would be to use Dominus' excellent Memoize module. I wrote a review of the module: read the review of the Memoize module.
Using it could be as simple as writing a function get_data_structure_from_database, saying memoize 'get_data_structure_from_database', and continuing to use that function!
| [reply] [d/l] |
|
|
I would like to add a second vote for use Memoize; your lookup functions are expensive and read-only, exactly what memoization is good for.
In addition to the behavior ariels mentions above, I'd like to point out it is easy to persistantly store the memoized data using tie, no muss, no fuss.
use DB_File;
tie my %cache => 'DB_File', $filename, O_RDWR|O_CREAT, 0666;
memoize 'function', SCALAR_CACHE => [HASH => \%cache];
In addition, Memoize makes a good profiler-- details are available here in Dominus's article.
___
-DA
> perl -MPOSIX -e'$ENV{TZ}="US/Eastern";print ctime(10**9)'
Sat Sep 8 21:46:40 2001
| [reply] [d/l] [select] |
Re: Using tie to initialize large datastructures
by Zaxo (Archbishop) on Aug 08, 2001 at 11:20 UTC
|
It would be helpful if you posted some code and example data. This sounds as if you read the entire database into memory. If so, that will certainly slow things down.
Are you using DBI.pm to read the data? A few suggestions: - Look into cached database connections
- Consider cached, prepared select statements with placeholders.
- Look at what DBI::bind() can do for you.
- Get ruthless with globals, replace them with lexicals which only hold data you need.
- Benchmark. No rule of thumb can replace actual performance measurements.
A tie class may be useful, but a few sub returning data, given a key, is likely to work as well. Of course, all this is speculative and not necessarily useful.
After Compline, Zaxo
Update: Changed list to numbered format for reference.
Thanks for the extra info on your requirements. I'm sorry to admit that I'm unfamiliar with any of the mechanisms you cite (Apache::ASP, CORBA through the COPE modules). You don't appear to use either CGI.pm or DBI.pm. (Update2: I'm informed that htoug uses DBI.) You might try #5 'Benchmark' right away, to see where the resource hogs are. Given your security requirements, #4 is all the more important. As a design issue, I'd suggest starting from the user interface and seeing how few SQL statements you need to support it.
| [reply] |
|
|
I'm definitly not trying to read the entire database into memory. That would take about 40GB, and we don't have that much available for each apache process!
The system is a 3 tier system, with an apache frontend (written using Apache::ASP, handling the formatting of data, session handling etc), a set of application servers (communicating with apache using CORBA through the COPE modules), and the database (about 40 GB of data in ~800 table, the largest containg >130mill rows, access using DBI, DBD::Ingres {which I wrote} etc) - all on different machines. The database contains very sensitive data, so security is important.
We have some (about 50-100) table that contain things like eg zip-code, department addresses, typecodes, and so on ad nauseam. Some are small, some are big, others huge - it varies.
In the frontend code (on apache) we eg. need to create selectboxes, that let the user choose between different options, based on the content of the constant tables.
A possibility would be to fetch the data everytime it is needed:
my $zip = $zip_server->get_zip_codes();
print "selectbox-header";
for (@$zip) {
print "selectbox line";
}
print "selectbox-end";
or something like that.
This will take quite a while and soon you discover the need for caching the data. So you try something like:
...in common initialisation code...
our $zip;
$zip=$zip_server->get_zip_codes();
...where the zip-code is needed...
print "selectbox-header";
for (@$zip) {
print "selectbox line";
}
print "selectbox-end"
This is fast, but it takes more and more memory as the number of constants rise. So the next version could be something like:
...in the common initialisation code...
our $zip;
sub zip_init {
$zip = $zip_server->get_zip_codes() unless $zip;
}
...at every use...
zip_init();
print "selectbox-header";
for (@$zip) {
print "selectbox line";
}
print "selectbox-end";
That is fast, easy and does not comsume unnessacry amounts of memory. the downside is that you have to remember to call the zip_init before you use $zip.
Sometimes you forget, and spend an excessive amount of time scratching your head and trying to fathom what went wrong.
So I would like something like:
...in initialisation section...
our $zip;
tie $zip .... # magic here
sub ZIP::TIE::FETCH {
# smoke and mirrors here
$zip = $zip_server->get_zip_codes();
untie $zip; # and leave the data in $zip
}
..and where we use it...
print "selectbox-header";
for (@$zip) {
print "selectbox line";
}
print "selectbox-end";
Note no zip_init, fetch calls. Just the plain ordinary access to a variable.
At the first reference to the variable the tie magic clicks in and retrieves the data, and removes the magic, leaving the 'naked' variable.
Giving
- no need the remember the initialisation incantations (we all forget things too often)
- no performance overhead
Did that clarify what I need? | [reply] [d/l] [select] |
|
|
I think you are trying to solve the wrong problem.
First of all, gratuitous globals is a sign of a poor
design. I would use an access function, and (depending
on what made sense) I would have it memoize results.
Much cleaner design, and your issue never arises. Unless
your program is truly performance sensitive (the odds are
very low that it is), trying to optimize before hand at
the expense of maintainability is a losing game.
However the second issue is technical. In the middle of
calling an implementation of a tie, you don't have access
to information about the tie. A tie just replaces a
data structure with a wrapper around an object. But from
the point of view of the object call, it is just an
object call. You are not told what variable you are being
called with, and said variable may not even be in any
scope you can access. (Think about tying a lexical
variable.)
Now the technical issue I could find a hack around.
But the maintainability issue makes me really not want to...
| [reply] |
Re: Using tie to initialize large datastructures
by clemburg (Curate) on Aug 08, 2001 at 17:18 UTC
|
Would something like this suite your needs (demo just for scalar variables)?
Approach: tie() builds up a mapping between tied objects
and fully qualified subroutine names. When the tied object
is asked for its value, we call the subroutine with the
name passed when tie()ing the object. This subroutine
caches the object's value.
Benefits:
- Existing code can stay the same, you just need to
tie() the global variables.
- Existing init routines can be reused.
- If you pass in a fully qualified name for a package global or a reference to a lexical variable as an additional argument to the FETCH method, the tie to this
variable will be undone after first use. (Thanks to tilly
for pointing out this works with lexicals, too!)
Update: I see you called for something more -
untie()ing the object after first use. For lexical ("my")
variables, I don't currently see how to do this, since
we have no access to them inside the FETCH function
(Ah, we *can* have that - thanks tilly again - see above).
For true package globals, it's easy: just set the (in this
case scalar) entry of the glob in the package you call
the FETCH from to the new value and remove the tie().
Hm ... let's see if this works ... yup it does!
Demo code:
#!/usr/bin/perl -w
use strict;
# show how to tie scalars existing init routines the lazy way
$| = 1;
# --------------------------------------------------
package LegacyRoutines;
use vars qw($AUTOLOAD);
sub foo {
print __PACKAGE__ . "::foo magically called\n";
return 42;
}
sub baz {
print __PACKAGE__ . "::baz magically called\n";
return "hooray";
}
# no bar routine here - catch errors
sub AUTOLOAD {
"LegacyRoutines: undefined subroutine $AUTOLOAD called\n";
}
# --------------------------------------------------
package MyGlobals;
# global to map objects to associated init routine names
my %mappings;
# global to memorize package globals to initialize
my %vars;
sub TIESCALAR {
my $class = shift;
my ($name, $var) = @_;
bless \ (my $self), $class;
$mappings{\$self} = $name;
$vars{\$self} = $var;
return \$self;
}
sub FETCH {
print __PACKAGE__ ."::FETCH called\n";
# $_[0] - alias to original object ref we stored in %mappings
my $value;
if (not defined ${$_[0]}) {
print "Initializing $_[0] ... \n";
# check if we have an entry for that object
if (not exists $mappings{$_[0]}) {
print "No matching subroutine for ", $_[0], "\n";
return $_[0];
}
# call to init routine associated with $self
no strict 'refs';
# set original value
${$_[0]} = &{ $mappings{$_[0]} }();
# remember it
$value = ${$_[0]};
# untie package global
if (exists $vars{$_[0]}) {
untie ${$vars{$_[0]}};
}
return $value;
}
return ${$_[0]};
}
sub STORE {
# whatever you want
}
sub DESTROY {
# whatever you want
}
# --------------------------------------------------
package main;
use vars qw($foo);
tie($foo, "MyGlobals", "LegacyRoutines::foo", "main::foo");
tie(my $bar, "MyGlobals", "LegacyRoutines::bar");
tie(my $baz1, "MyGlobals", "LegacyRoutines::baz");
tie(my $baz2, "MyGlobals", "LegacyRoutines::baz");
# make $baz2 a de-facto alias to $baz1
print $foo, "\n";
print $foo, "\n";
print $bar, "\n";
print $baz1, "\n";
print $baz1, "\n";
print $baz2, "\n";
print $baz2, "\n";
Christian Lemburg
Brainbench MVP for Perl
http://www.brainbench.com | [reply] [d/l] |
|
|
Write an initialization function that accepts two
arguments. References to the variable you wish to tie and
the function you want to provide its initial value. The
initialization function then does a tie of the variable
which passes as one of the arguments a reference to the
variable you are tying. Now you have access inside the
FETCH routine to the untie logic.
Note, though that I would avoid this solution. To me using so many globals that initializing them all takes too much
memory is the real problem, and finding ways to
enable that mistake to be extended is worse than
fixing the mistake...
| [reply] |
|
|
I am definitively with you on the point of using so many globals. I would never consider to do something like this
on many global variables just because of efficiency concerns. A design that needs such hacks is probably flawed.
OTOH, I found this to be an interesting problem with respect to tie() usage.
As for your suggestion - thanks! This really works. Funny.
You can even use the code like it stands. Just pass in
a ref to the lexical, and you're done. Like this:
my $baz3;
tie($baz3, "MyGlobals", "LegacyRoutines::baz", \$baz3);
print $baz3, "\n";
print $baz3, "\n";
Christian Lemburg
Brainbench MVP for Perl
http://www.brainbench.com | [reply] [d/l] |
|
|
Why are you even bothering to do it that way?!?
by dragonchild (Archbishop) on Aug 08, 2001 at 18:30 UTC
|
I'm failing to see the compelling reason for designing your system in this fashion. I would look at doing something that has the following characteristics:
- Is encapsulated. You (the requesting script) do not know how the thing does what it does. All you care about is that it does what it promises to do, which is retrieve your data.
- Is fast. You want it to give you the data you request in a minimum amount of time.
- Is fast. You want it to load in a minimum amount of time.
- Is small. You want it to use the least amount of memory.
Sounds pretty tough, huh? Well, it's not. What you are looking for is not a datastructure, but an object.
YAY-US! You, too, can be a part of the O-O revolution, my friend! You can be HEE-ULLED of your pro-see-ju-rull ways!
What you're looking for is not something that loads all your data at once. That is waaay too slow to load, as I'm sure you've noticed already. You're looking for something that will cache data.
Now, others have suggested using DBI's caching, and that's good, or some sort of memorize, and that's good, too. I'm suggesting a third method, and that is to write an object that will hide your data-retrieval methods from yourself.
The basic concept is this - you instantiate this object. Then, when you need some data, you ask it for that data, and only that data. It will then check to see if it has it. If it doesn't, then it will go out to the database, get the data, store it within itself, then give it to you. Now, if you ask for that data again (for whatever reason), you will get the data immediately. You don't store the data ... this object does.
This method immediately allows for three things:
- You get rid of all those nasty globals. Now, all you have is a file-scoped lexical (the object) that will handle all your data needs.
- You can request the same data over and over and not incur a performance penalty. This means that your logic flow is cleaner and clearer. Your routines are more loosely coupled. (This is a good thing, in case you're wondering.)
- If you have more than one script that uses these data structures and, because you will, you end up changing them, you only change stuff in one place! Think about that - maintenance is made 10x easier. I know I always like that.
Now, you're gonna say "Well, I wrote the object, so I'm storing the data. You're just making my life more complicated."
My answer is simple - "No. You are the script that needs the data, or the general. The object is someone else, a quartermaster if you like. Even though the general puts the quartermaster in his position, he still has to requisition supplies through a known and agreed-upon method."
------ /me wants to be the brightest bulb in the chandelier!
Vote paco for President! | [reply] |
|
|
| [reply] |
|
|
| [reply] |
|
|
Re: Using tie to initialize large datastructures
by mattr (Curate) on Aug 08, 2001 at 17:06 UTC
|
I don't understand the part about forgetting to call
zip_init. But I tend to agree with tilly about using a
separate function.
It seems that memoize's cache over tied hash would work,
or maybe you'd like to periodically dump your db
into Tie-MmapArray files,
which seems to resemble your request of an auto init function.
You'd pick which pre-prepared file to tie in after getting
user input, and voila your memory is useable.
| [reply] |