Re: Of large database tables and high memory usage.

How many results is 'a lot' ? 600k? 80 million?

What does your table look like? How many cols and what types?

I find it really strange that you bring up memory as the issue, with the problem you mention, I have always experienced a cpu issue- and memory has been a joke. My 'high' number row counts are like.. 12,400,855- that may be nothing compared to yours. Dunno.

What I do with high intensity queries- is kind of what you mention- is that i abstract it away. I only fetch x number of results. I let the api give out one at a time and then when no more are there, i serve.

I want to show you one example.

The following example keeps a query in an object instance.

# keeps a list fed in object
sub get_next_indexpending {
   my $self= shift;

   unless( defined $self->{pending_queue} and scalar @{$self->{pending
+_queue}} ){
      # this operation can be expensive. SO, i get 50 at a time, and c
+ache it in the object
      # as the API, it seems like you just keep asking for the next on
+e
      # we do not actually query the db for the next one, because that
+ would be EXCRUCIATINGLY SLOW
      # even asking for many more, could be slow
      # i've fiddled around with maybe 3 or 4 ways of doing this opera
+tion, this works well
      # it's been debugged a lot, there have been MANY bugs doing this
+, so DONT FUCK WITH IT :-)
      # multiple indexers *can* have the same list- that's ok, because
+ only one will lock
      # the funny thing is if you select 50 at a time, and you have 51
+ indexers.. then what???? 

      # I THINK THERE IS A RACE CONDITION HERE
      # I think there should be a formula for :
      #        ( how many indexers are running * NUMBER ) = LIMIT

      my $LIMIT = 50;
      
      # we could be querying a LOT ? It seems that would be wasteful t
+o all hell.
      # maybe the select can be random or ordered in different ways, a
+lternating ????

   
      debug("pending queue is empty.. ");

      #make sure it's defined
      $self->{pending_queue}=[];
      


      if (defined $self->{gpd_stopflag} and $self->{gpd_stopflag} ){
         debug("stopflag was raised, no more in pending. Will prepare 
+and execute..");
         return; # so we return undef.
      }
      
      debug("will refeed next $LIMIT");
      
      # this is a hack replacement since i cant prepare with passing o
+ffset
      my $gpd = $self->dbh_sth( # can not figure out how to pass offse
+t to a prepped query
               'SELECT abs_path, md5sum FROM files WHERE NOT EXISTS'.
               '(SELECT id FROM md5sum WHERE md5sum.md5sum = files.md5
+sum LIMIT 1)'.
               "GROUP BY md5sum LIMIT $LIMIT" 
      ); 
      # i realized getting first 50 or first whatever.. IS ALWAYS VALI
+D
      # Because if it is already been indexed by another indexer.. tha
+t operation is committed
      # and subsequent selects to next pending list.. will no longer r
+eturn that file as a result
      # SO, dont use an incrementing offset. seems like it made sense.
+. but NO.

                     
      $gpd->execute;     
      
      debug("ok.\nWill iterate through results..");
      
      while (my @row = $gpd->fetchrow_array){ # WAS USING for!!! 
        # debug("into queue [@row])");
         push @{$self->{pending_queue}}, \@row;
      }
     
      debug(sprintf "got [%s]\n", scalar @{$self->{pending_queue}});
           

      # how about.. if count is less then 50, turn on a stop flag so w
+e dont keep requesting pending.. ???

      if (scalar @{$self->{pending_queue}} < 50 ){
         $self->{gpd_stopflag} = 1;
         debug( sprintf "got less then 50 (got %s), turning on stop fl
+ag\n", scalar @{$self->{pending_queue}});
      }
      

      scalar @{$self->{pending_queue}} or warn("no more pending files 
+found");   
      
   }

   my $a = shift @{$self->{pending_queue}};
   defined $a or return;

   my ($abs_path,$md5sum) = @$a;
   debug("returning abs path, $md5sum\n");

   $abs_path or die("missing abs path");
   $md5sum or die("missing md5sum");

   return ($abs_path,$md5sum);
}

=head3 get_next_indexpending()

no argument
returns abs_path, md5sum string for next file in queue
you should attempt to lock afterwards
beacuse of the nature of indexing, it can take a long time, and we may
+ be running multiple indexers, so attempting to lock is needed

if none in pending, returns undef

everytime you call get_next_indexpending, it returns a different file

   while( my ($abs_path,$md5sum) = $self->get_next_indexpending ){
      # lock or next
   }   

The md5sum string is the md5 hex sum for the file data at the time the
+ files table was updated
you should check it again on disk so you know it has not changed in th
+e meantime, and also, if you are remote indexing
to make sure the data was not corrupted in transit

This sub DOES return either those two values OR undef.

=cut
[download]

This is straight from the source, so there's some seemingly irrelevant code.

My point is that Yes, you are insane. Thinking you will solve this problem accross the board for all kinds of situations is .. nuts. If you could do that, you would put all database people out of work. There's a reason why 'you should possibly not be asking this on perlmonks', and in the mysql site instead. Because what you are trying to do with mysql, is a world unto itself!!! I used to think I would code perl ,then just .. you know.. just do a teeny bit of sql lookup on the side.. NO! What database you use, your column types, if you have an index, if you normalize.. this makes or breaks your sh1+ entirely.

I think if you try to 'abstract the problem away' in some API, you'll waste a ton of time and come up with a wonderful set of solutions for less general situations then you would like. But.. shucks, you might solve the whole problem too.

Comment on Re: Of large database tables and high memory usage. Download Code

Replies are listed 'Best First'.
Re^2: Of large database tables and high memory usage. by mseabrook (Beadle) on Jul 30, 2007 at 23:01 UTC
Given what I'm seeing, the tables don't have to be that large, in terms of record count, to have a large impact on memory usage. To give you an example, I have one table with 2,048,812 records. A decent amount, but not other-worldly. I don't have a wonderful grasp on how linux uses or reports memory, so I'll show you what top is showing me when I select from that table. This is with a fresh perl instance: `PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14069 mseabroo 17 0 9336 5540 1816 S 0.0 2.1 0:05.68 iperl` [download] And after selecting all records from the table with a simple "prepare", "execute": `PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14069 mseabroo 16 0 250m 159m 844 S 0.0 61.1 0:11.22 iperl` [download] This is immediately after calling execute(). It should be noted that I'm selecting all columns here, but collectively, they don't amount to much; a couple fixed-width chars, a few integers, a few doubles, and a datetime. In any case, it appears mysql prefers to put the burden of processing on the client, which is why it's tossing the entire result set over to my script. This behavior, while default, is optional, though. In spite of that, I don't yet know if it will be feasible to turn it off.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Of large database tables and high memory usage.
by mseabrook (Beadle) on Jul 30, 2007 at 23:01 UTC

Given what I'm seeing, the tables don't have to be that large, in terms of record count, to have a large impact on memory usage.

To give you an example, I have one table with 2,048,812 records. A decent amount, but not other-worldly.

I don't have a wonderful grasp on how linux uses or reports memory, so I'll show you what top is showing me when I select from that table.

This is with a fresh perl instance:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
14069 mseabroo  17   0  9336 5540 1816 S  0.0  2.1   0:05.68 iperl
[download]

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
14069 mseabroo  16   0  250m 159m  844 S  0.0 61.1   0:11.22 iperl
[download]

This is immediately after calling execute().

It should be noted that I'm selecting all columns here, but collectively, they don't amount to much; a couple fixed-width chars, a few integers, a few doubles, and a datetime.

In any case, it appears mysql prefers to put the burden of processing on the client, which is why it's tossing the entire result set over to my script. This behavior, while default, is optional, though. In spite of that, I don't yet know if it will be feasible to turn it off.

[reply]
[d/l]
[select]