Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

Re: Efficient regex search on array table

by kcott (Archbishop)
on Dec 16, 2022 at 09:10 UTC ( [id://11148917] : note . print w/replies, xml ) Need Help??

in reply to Efficient regex search on array table

"Can this be streamlined any more?"

Short answer: certainly. Longer answer: read on ...

Firstly, two things you should stop doing:

  • The syntax for an if statement has no terminal semicolon; an if statement modifier may have a terminal semicolon. So, write "if (CONDITION) BLOCK" or "... if CONDITION;". The statement may be extended with else and elsif blocks (which don't have terminal semicolons either). See perlsyn for full details.
  • Do not call subroutines with a leading '&' unless you know exactly what it does and why you are doing it. A leading '&' was the normal way to call subroutines in Perl4 (and earlier); it is not normal for Perl5. See perlsub for more on this.

When you find yourself writing $var1, $var2, ..., $varN, you're probably doing something wrong, or at least doing something suboptimal which can be greatly improved: often with arrays or hashes. What happens when you want to add a new language, or discontinue supporting an existing one: as it stands, you'll have to engage in a major rewriting exercise.

Instead of returning arrays from subroutines, consider returning arrayrefs. With 30,000+ rows, each "my @array = &getTableFC(...)" statement is potentially passing a megabyte or more of data; and arrayref is a single, scalar value.

I assume that you're trying to trim leading and trailing whitespace with "s/^\s+|\s+$//". That won't work. There are a variety of ways you could do this; one would be to add a /g modifier:

$ perl -e 'my $x = " X "; $x =~ s/^\s+|\s+$//; print "|$x|";' |X | $ perl -e 'my $x = " X "; $x =~ s/^\s+|\s+$//g; print "|$x|";' |X|

Perl's string handling functions and operators are almost always measurably faster than a regex achieving the same functionality. I see no reason to use any regexes in your code. I'd recommend using index().

I've rewritten your code, taking all of the above into account. First some notes:

  • I've no idea where your base data comes from. Presumably a disk file, a database, or whatever. I've just written a GET_DATA routine to simulate this.
  • I don't know how your users supply queries. I used the one from your OP, and added a second, in @user_tests. I loop through these to simulate multiple queries.
  • I've only used some of your data: "Main", "Comp. 1", and "Comp. 3". I've given them more meaningful names: "ref_lang", "Lang A", and "Lang Q", respectively. This was mainly to show that the code is not bound to any specific number of languages; not wanting to copy-paste all five of your languages was also a factor.
  • I've pared down processComparison() to a bare minimum; even with a couple of helper routines, it's still substantially smaller than the original.
  • The "# For demo" code equates to your "#DO CODE TO FORMAT ...". I didn't know what you wanted here, so I've just output the data.

Here's the code:

#!/usr/bin/env perl use strict; use warnings; sub GET_DATA (); my @user_tests = ( { ref_lang => { match => 1, strings => ['hollow log', 'fence'], +}, 'Lang A' => { match => 1, strings => ['jumped'], }, 'Lang Q' => { match => 0, strings => ['red fox'], }, }, { ref_lang => { match => 1, strings => ['tree', 'fence'], }, 'Lang A' => { match => 1, strings => ['tree', 'doe'], }, 'Lang Q' => { match => 0, strings => ['gray fox'], }, }, ); for my $i (0 .. $#user_tests) { processComparison(GET_DATA, $user_tests[$i], $i); } sub processComparison { my ($data, $test, $index) = @_; my %results; for my $lang (keys %$test) { $results{$lang} = check_match($data->{$lang}, $test->{$lang}); } # For demo use Data::Dump; print "User Test Index $index:\n"; dd $test; print "Results:\n"; dd \%results; print '-' x 60, "\n"; return; } sub check_match { my ($data, $test) = @_; my @results; DATUM: for my $i (0 .. $#$data) { STRING: for my $string (@{$test->{strings}}) { next STRING unless match_ok( $test->{match}, index $data->[$i], $string ); push @results, $i + 1; next DATUM; } } return \@results; } sub match_ok { my ($match, $index) = @_; return $match ? $index > -1 : $index == -1; } sub GET_DATA () { return { ref_lang => [ 'The red fox jumped over the hollow log.', 'The tall tree towered over the animals.', 'The tawny deer jumped over the fence.', ], 'Lang A' => [ 'The vixen jumped over the brown log.', 'The tree swayed smartly in the field.', 'The doe jumped over the barrier.', ], 'Lang Q' => [ 'The gray fox jumped over the big log.', 'The tall tree grew in the field.', 'The red deer cleared the tall fence.', ], }; }

That outputs:

User Test Index 0: { "Lang A" => { match => 1, strings => ["jumped"] }, "Lang Q" => { match => 0, strings => ["red fox"] }, "ref_lang" => { match => 1, strings => ["hollow log", "fence"] }, } Results: { "Lang A" => [1, 3], "Lang Q" => [1, 2, 3], "ref_lang" => [1, 3] } ------------------------------------------------------------ User Test Index 1: { "Lang A" => { match => 1, strings => ["tree", "doe"] }, "Lang Q" => { match => 0, strings => ["gray fox"] }, "ref_lang" => { match => 1, strings => ["tree", "fence"] }, } Results: { "Lang A" => [2, 3], "Lang Q" => [2, 3], "ref_lang" => [2, 3] } ------------------------------------------------------------

I didn't know how "the other columns are merely for comparison purposes" was supposed to work; i.e. in terms of what activities you had in mind. So, you have the data; adapt to your needs.

— Ken

Replies are listed 'Best First'.
Re^2: Efficient regex search on array table
by Polyglot (Chaplain) on Dec 16, 2022 at 10:23 UTC

    They need to add a trophy or star function for superior posts like yours. Upvoting seems poor recompense for the amount of effort you put into that. Thank you--and you correctly discerned some of my failings.

    Yes, I was intending to remove leading and trailing whitespace. The reason for this is that a space on either side will throw off the matching for searches in which the user specified that the match must occur at the beginning or at the end. So the space removal is for the benefit of the regex later.

    And, yes, each array processed line-by-line will actually have between 4 MB (at the lowest end) and around 250+ MB for one particular annotated version (with full HTML mouseovers, etc.); but the average being closer to 10 MB each. So you were correct that each one is over a MB. These are all coming from a database, each file represented in a separate table in the DB. The routine which feeds the array pulls every row of the table at once, and this is done to speed up the database portion, by not having to use 30,000+ calls to the DB, one per row, and it was also my understanding that it was less expensive, time-wise, to use some RAM than to make repeated I/O calls. I may be mistaken--you seem to have a good grasp of these things, so feel free to clarify.

    The clients have two options for forming their query--and these options are individually available on a per-column basis: 1) they can use a simple, standard search, entering a keyword or phrase of their choice, then ticking checkboxes for case-sensitivity, whole-word (\bwhole-word\b) searching, must match at beginning or end, etc.; and 2) they can tick the "Use PERL regex" option which then disables all the other options and they are on their own with specifying what they want to match via formulation of their own regular expression. The subroutine I call for returning the regex handles both alternatives, returning in qr// form.

    I will try out your code when I have a chance--probably won't be for another couple of days until my next window of opportunity. I very much appreciate your effort.

    By the way, I didn't see much, if any, improvement with the addition of the "o" (m//o) for matching. I think this might be because the $regex is already in qr// form--but perhaps I'm simply not aware of how that affects things.

    P.S. Oh, and by the way, I'm developing on Perl 5.12.4.



      Thankyou for your kind words. By the way, instead of "failings" (negative); think "opportunities for improvement" (positive).

      Take a look at "perlperf - Perl Performance and Optimization Techniques". There's a lot of information on benchmarking and profiling tools. Use these to determine what's fast and what's slow, where bottlenecks occur, and so on. This is a much better approach than going on gut-feeling, anecdotal evidence, and the like.

      My $work often involves dealing with biological data (tends to be measured in GB, rather than MB). Functions which return large datasets are a red-flag to me; references to such data are nearly always a better choice.

      I had thought that queries like "(hollow log)|(fence)" would result in regexes like "/(?:hollow log|fence)/". There was no indication that anything more complex was involved. Your new information indicates that's not the case. For your keyword searches, I'd still recommend index(); when using anchors (^, \b, etc.), and such like, regexes are probably the correct approach.

      I recommend you change "Use PERL regex" to "Use Perl regex": Perl is the language; perl is the program; PERL is not a thing. :-)

      Good luck with your continued optimisation efforts; and, of course, do ask if you need further help.

      — Ken