Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

comment on

( [id://3333] : superdoc . print w/replies, xml ) Need Help??

I feel like I may be asking for the impossible, but seeing as there are many programmers here of superior ability, I am hoping there are significant improvements that I would never have thought of. (And speed is a rare need for me, as most of my scripts are one-time-use or quick enough that efficiency is of minor concern.)

I'm searching on what amounts to a table of 30,000+ lines, with from one to five "columns." In Perl, each of these "columns" is an array (list) where each array has an identical number of rows/items. The search applies a regular expression to each column in each row, and the expression may differ.

For example, suppose we have a table like this:

MainComp. 1Comp. 2Comp. 3Comp. 4
The red fox jumped over the hollow log.The vixen jumped over the brown log.The red fox leaped over the hollow log.The gray fox jumped over the big log.The lazy fox did not jump over the log.
The tall tree towered over the animals.The tree swayed smartly in the field.The animals looked up to the tall tree.The tall tree grew in the field.The towering tree shaded the animals.
The tawny deer jumped over the fence.The doe jumped over the barrier.The brown doe leaped over the picket fence.The red deer cleared the tall fence.The doe-eyed doe sprang over wall.

And queries like this:

MainComp. 1Comp. 2Comp. 3Comp. 4
Match:(hollow log)|(fence)Match:(jumped)NOT match:(vixen)|(animals)NOT match:(red fox)Match:(jump)|(leap)|(sprang)

This is just an English example to help understand the situation. The actual "columns" may represent different languages, i.e. translations, of the same thing, and each language/column will be searched with its own regular expression.

The data coming in to the subroutine includes the regex to use for each column (five--but if some are empty, that column does not need handling), the array for each column (which is skipped/empty if no regex for that column was provided), and whether or not the regex should match OR NOT match (boolean values for each column).

The expected results from the query would be the values of the first column for rows 1 and 3. The central row would not match. Only the first column's values get returned, the other columns are merely for comparison purposes--looking for similarities or contrasts to the "original" (main) column.

Here are the important bits of my code (abridged for better focus and readability):

sub processComparison { # INCOMING COMPARISON-COLUMN DETAILS my ( $table, #MAIN COLUMN NAME $ACCP_searchver1, #COMPARISON COLUMN NAMES/SELECTIONS $ACCP_searchver2, $ACCP_searchver3, $ACCP_searchver4, $ACCP_comp1, #ORIGINAL USER-ENTERED QUERY $ACCP_comp2, $ACCP_comp3, $ACCP_comp4, $ACCP1_regex, #USER SELECTION FOR REGEX vs. SIMPLE MATCH $ACCP2_regex, $ACCP3_regex, $ACCP4_regex, $accpyn1, #USER SELECTION FOR MATCH/NO MATCH $accpyn2, $accpyn3, $accpyn4 ) = @_; # INCOMING ARRAY my $regex1 = &composeRegex($ACCP1_regex,$ACCP_comp1); my $regex2 = &composeRegex($ACCP2_regex,$ACCP_comp2); my $regex3 = &composeRegex($ACCP3_regex,$ACCP_comp3); my $regex4 = &composeRegex($ACCP4_regex,$ACCP_comp4); my @main = &getTableFC($table); my @crosscheck1 = &getTableFC($ACCP_searchver1) if ($regex1); my @crosscheck2 = &getTableFC($ACCP_searchver2) if ($regex2); my @crosscheck3 = &getTableFC($ACCP_searchver3) if ($regex3); my @crosscheck4 = &getTableFC($ACCP_searchver4) if ($regex4); my $linecount=0; my ($line, $line1, $line2, $line3, $line4) = ('','','','',''); foreach $line ( @main ) { $line1 = $crosscheck1[$linecount]; $line2 = $crosscheck2[$linecount]; $line3 = $crosscheck3[$linecount]; $line4 = $crosscheck4[$linecount]; $line =~ s/^\s+|\s+$//; chomp $line; $line1 =~ s/^\s+|\s+$//; chomp $line1; $line2 =~ s/^\s+|\s+$//; chomp $line2; $line3 =~ s/^\s+|\s+$//; chomp $line3; $line4 =~ s/^\s+|\s+$//; chomp $line4; my ($r1,$r2,$r3,$r4) = (0,0,0,0); #USING THESE TO TALLY MATC +HES # CHECK REGEX MATCHES FOR COMPARISON COLUMNS if ($regex1) { $r1++; if ($accpyn1) { if ($line1 =~m/$regex1/) { $r1++ } } else { if ($line1 !~ m/$regex1/) { $r1++ } } }; if (($regex2) && ($r1!=1)) { $r2++; if ($accpyn2) { if ($line2 =~m/$regex2/) { $r2++ } } else { if ($line2 !~ m/$regex2/) { $r2++ } } }; if (($regex3)&&($r1!=1)&&($r2!=1)) { $r3++; if ($accpyn3) { if ($line3 =~m/$regex3/) { $r3++ } } else { if ($line3 !~ m/$regex3/) { $r3++ } } }; if (($regex4)&&($r1!=1)&&($r2!=1)&&($r3!=1)) { $r4++; if ($accpyn4) { if ($line4 =~m/$regex4/) { $r4++ } } else { if ($line4 !~ m/$regex4/) { $r4++ } } }; # LINE UP MATCH RESULTS TALLIES if ( ($r1!=1) && ($r2!=1) && ($r3!=1) && ($r4!=1) ) { #PASSED COMPARISON FILTER #DO CODE TO FORMAT & RETURN $line FOR MAIN COLUMN } } # end foreach $line } #END SUB processComparison

As you may notice, I have attempted to exit early from loops that are found to be no longer necessary. If any single column fails to match its regex, the entire row will fail--so there is no need to extensively test the remaining columns. I have also pre-established my regular expressions, though I am not sure if there is a way to improve this. The expression used for each column will remain the same for all rows in the table.

At present, the average return, if using a regular expression of moderate complexity, is between 45 and 65 seconds, and in checking the processing times for various segments, this time is mostly (98%) concentrated in the regex-matching section. If it is strictly a "match"/"no match" for a simple word, without any regex alternations or other complexities involved, I have seen as little as 11 or so seconds. Even that seems a little bit longer than I wish, seeing as the results will be returned to the client's browser.

Can this be streamlined any more?



In reply to Efficient regex search on array table by Polyglot

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.