How many results is 'a lot' ? 600k? 80 million?
What does your table look like? How many cols and what types?
I find it really strange that you bring up memory as the issue, with the problem you mention, I have always experienced a cpu issue- and memory has been a joke. My 'high' number row counts are like.. 12,400,855- that may be nothing compared to yours. Dunno.
What I do with high intensity queries- is kind of what you mention- is that i abstract it away. I only fetch x number of results. I let the api give out one at a time and then when no more are there, i serve.
I want to show you one example.
The following example keeps a query in an object instance.
# keeps a list fed in object
sub get_next_indexpending {
my $self= shift;
unless( defined $self->{pending_queue} and scalar @{$self->{pending
+_queue}} ){
# this operation can be expensive. SO, i get 50 at a time, and c
+ache it in the object
# as the API, it seems like you just keep asking for the next on
+e
# we do not actually query the db for the next one, because that
+ would be EXCRUCIATINGLY SLOW
# even asking for many more, could be slow
# i've fiddled around with maybe 3 or 4 ways of doing this opera
+tion, this works well
# it's been debugged a lot, there have been MANY bugs doing this
+, so DONT FUCK WITH IT :-)
# multiple indexers *can* have the same list- that's ok, because
+ only one will lock
# the funny thing is if you select 50 at a time, and you have 51
+ indexers.. then what????
# I THINK THERE IS A RACE CONDITION HERE
# I think there should be a formula for :
# ( how many indexers are running * NUMBER ) = LIMIT
my $LIMIT = 50;
# we could be querying a LOT ? It seems that would be wasteful t
+o all hell.
# maybe the select can be random or ordered in different ways, a
+lternating ????
debug("pending queue is empty.. ");
#make sure it's defined
$self->{pending_queue}=[];
if (defined $self->{gpd_stopflag} and $self->{gpd_stopflag} ){
debug("stopflag was raised, no more in pending. Will prepare
+and execute..");
return; # so we return undef.
}
debug("will refeed next $LIMIT");
# this is a hack replacement since i cant prepare with passing o
+ffset
my $gpd = $self->dbh_sth( # can not figure out how to pass offse
+t to a prepped query
'SELECT abs_path, md5sum FROM files WHERE NOT EXISTS'.
'(SELECT id FROM md5sum WHERE md5sum.md5sum = files.md5
+sum LIMIT 1)'.
"GROUP BY md5sum LIMIT $LIMIT"
);
# i realized getting first 50 or first whatever.. IS ALWAYS VALI
+D
# Because if it is already been indexed by another indexer.. tha
+t operation is committed
# and subsequent selects to next pending list.. will no longer r
+eturn that file as a result
# SO, dont use an incrementing offset. seems like it made sense.
+. but NO.
$gpd->execute;
debug("ok.\nWill iterate through results..");
while (my @row = $gpd->fetchrow_array){ # WAS USING for!!!
# debug("into queue [@row])");
push @{$self->{pending_queue}}, \@row;
}
debug(sprintf "got [%s]\n", scalar @{$self->{pending_queue}});
# how about.. if count is less then 50, turn on a stop flag so w
+e dont keep requesting pending.. ???
if (scalar @{$self->{pending_queue}} < 50 ){
$self->{gpd_stopflag} = 1;
debug( sprintf "got less then 50 (got %s), turning on stop fl
+ag\n", scalar @{$self->{pending_queue}});
}
scalar @{$self->{pending_queue}} or warn("no more pending files
+found");
}
my $a = shift @{$self->{pending_queue}};
defined $a or return;
my ($abs_path,$md5sum) = @$a;
debug("returning abs path, $md5sum\n");
$abs_path or die("missing abs path");
$md5sum or die("missing md5sum");
return ($abs_path,$md5sum);
}
=head3 get_next_indexpending()
no argument
returns abs_path, md5sum string for next file in queue
you should attempt to lock afterwards
beacuse of the nature of indexing, it can take a long time, and we may
+ be running multiple indexers, so attempting to lock is needed
if none in pending, returns undef
everytime you call get_next_indexpending, it returns a different file
while( my ($abs_path,$md5sum) = $self->get_next_indexpending ){
# lock or next
}
The md5sum string is the md5 hex sum for the file data at the time the
+ files table was updated
you should check it again on disk so you know it has not changed in th
+e meantime, and also, if you are remote indexing
to make sure the data was not corrupted in transit
This sub DOES return either those two values OR undef.
=cut
This is straight from the source, so there's some seemingly irrelevant code.
My point is that Yes, you are insane. Thinking you will solve this problem accross the board for all kinds of situations is .. nuts. If you could do that, you would put all database people out of work. There's a reason why 'you should possibly not be asking this on perlmonks', and in the mysql site instead. Because what you are trying to do with mysql, is a world unto itself!!! I used to think I would code perl ,then just .. you know.. just do a teeny bit of sql lookup on the side.. NO! What database you use, your column types, if you have an index, if you normalize.. this makes or breaks your sh1+ entirely.
I think if you try to 'abstract the problem away' in some API, you'll waste a ton of time and come up with a wonderful set of solutions for less general situations then you would like. But.. shucks, you might solve the whole problem too.
|