comment on

I suspect that the $dbh->fetchrow_array() statements may be causing you issue, also you had a very significant syntax error that would have led to $h_title_counter ALWAYS being set to 1. Vis:

my @array = ( 'single result value' );
my $wrong = @array;     # SCALAR CONTEXT
my ($right) = @array;   # ARRAY CONTEXT
print "Wrong: $wrong\nRight: $right\n";
__DATA__
Wrong: 1
Right: single result value
[download]

I have cleaned up the code a little, moved stuff out of the loop that does not need to be there and added a bind case and dropped the $line declaration and just use $_ as you only use it once, and Perl has lazy GC. $jump will always equal $. (line count - so that is one less var)

  # we don't need this in the loop as we know when we increment (unles
+s forking)
  # also syntax IS WRONG in original - need array context on LHS to ge
+t val
  # otherwise you are always going to get the val 1
  ($h_title_counter) = $dbh->selectrow_array("SELECT MAX(id) FROM head
+_index");

  # this should be a sth with bind not $dbh->selectrow_array every tim
+e
  my $sthX = $dbh->prepare("SELECT 1 FROM e_match_word WHERE word = ?"
+);

  LONG:
    while (<READ>) {
          next if $. <= $rows2jump;
            $counter2++; #number of lines processed           
            #Lets deal with the sequences head here, marked by > so th
+is will
            # only exec if we have that marker
            if ( s/^>// ) {
                $sth1->execute($_) or print "$DBI::errstr";
                $h_title_counter++;
              next LONG;
            }

            #Here we will deal with the actal sequence
            chop; #To remove the > at the end of the line
            #Input stuff in mysql
             for my $x (0 ... (length($line) - ($word_length)-1)) {
                my $word = substr($_, $x, $word_length);
                #Now we have it with a big cache, hope to speed things
+ up
                if ($most_used_words{$word}) {
                    #UPDATE e_match_word set counter = counter+1 WHERE
+ word = ?
                    $sth3->execute($word) or print "$DBI::errstr";
                    #print "cached result\n";
                    
                } elsif ( $sthX->execute($word)) {
                    #UPDATE e_match_word set counter = counter+1 WHERE
+ word = ?
                    $sth3->execute($word) or print "$DBI::errstr";
                    #print "not cached result\n";
                    
                } else {                    
                    #INSERT INTO e_match_word VALUES (?,?)
                    $sth2->execute($word, 1) or print "$DBI::errstr";
                    $wordcounter++;
                    #print "new word\n";                    
                }           
                #INSERT INTO e_match_info values (?,?,?)
                $sth4->execute($word, $h_title_counter, $x) or print "
+$DBI::errstr";
            }            
            #$percent_done = int(($counter2/$counter) * 100);
            # you are setting status but never use it......
            #$status = "Searching for exact patterns.. processing line
+ $counter2 of $counter"; 
            $mw->update();
            # $mw->update;      <--- duplicate
            # exit if ($stop);  <--- never set in loop
    }
    close (READ);
[download]

Now the other issue you have is runtime. If you are say using MySQL with default settings you may error out with a MySQL server has gone away error unless you are using all you prepared STHs regularly ie within timeout (8 hours default). Probably not an issue but it would be a bit annoying if this died 6.9 days into a run.

Finally I can save you another database query. You can do this (pseudo code)

# currently
if ( DB query word exists in table ) {
    DB do update counter
}
else
    DB insert word with counter val of 1
}

# this will always use 2 DB queries per iter, whereas this code.....

unless ( execute(UPDATE table SET counter = counter + 1 WHERE word = ?
+) ) {
    # this should only fail if the bound word does not exist so 
    # know we know we really needed an INSERT with default value (1)
    # but we have saved 1 DB query - IN A LOOP
    execute(INSERT INTO table VALUES ( word, 1 ) )
}

# will only use one query if it is an update (probably the most common
+ case)
# and still only uses 2 for an insert.
# this can yield savings -> 50% depending on Update:Insert ratio
[download]

The logic is that if we need the insert the update will fail. If not the update succeeds. Thus we save directly verifying if 'word' is in the table or not.

cheers

tachyon

In reply to Re: Memory Overflow by tachyon
in thread Memory Overflow by cav

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.