Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Re^2: Efficient regex search on array table

by Polyglot (Chaplain)
on Dec 16, 2022 at 10:23 UTC ( [id://11148920] : note . print w/replies, xml ) Need Help??

in reply to Re: Efficient regex search on array table
in thread Efficient regex search on array table


They need to add a trophy or star function for superior posts like yours. Upvoting seems poor recompense for the amount of effort you put into that. Thank you--and you correctly discerned some of my failings.

Yes, I was intending to remove leading and trailing whitespace. The reason for this is that a space on either side will throw off the matching for searches in which the user specified that the match must occur at the beginning or at the end. So the space removal is for the benefit of the regex later.

And, yes, each array processed line-by-line will actually have between 4 MB (at the lowest end) and around 250+ MB for one particular annotated version (with full HTML mouseovers, etc.); but the average being closer to 10 MB each. So you were correct that each one is over a MB. These are all coming from a database, each file represented in a separate table in the DB. The routine which feeds the array pulls every row of the table at once, and this is done to speed up the database portion, by not having to use 30,000+ calls to the DB, one per row, and it was also my understanding that it was less expensive, time-wise, to use some RAM than to make repeated I/O calls. I may be mistaken--you seem to have a good grasp of these things, so feel free to clarify.

The clients have two options for forming their query--and these options are individually available on a per-column basis: 1) they can use a simple, standard search, entering a keyword or phrase of their choice, then ticking checkboxes for case-sensitivity, whole-word (\bwhole-word\b) searching, must match at beginning or end, etc.; and 2) they can tick the "Use PERL regex" option which then disables all the other options and they are on their own with specifying what they want to match via formulation of their own regular expression. The subroutine I call for returning the regex handles both alternatives, returning in qr// form.

I will try out your code when I have a chance--probably won't be for another couple of days until my next window of opportunity. I very much appreciate your effort.

By the way, I didn't see much, if any, improvement with the addition of the "o" (m//o) for matching. I think this might be because the $regex is already in qr// form--but perhaps I'm simply not aware of how that affects things.

P.S. Oh, and by the way, I'm developing on Perl 5.12.4.



Replies are listed 'Best First'.
Re^3: Efficient regex search on array table
by kcott (Archbishop) on Dec 16, 2022 at 13:58 UTC

    Thankyou for your kind words. By the way, instead of "failings" (negative); think "opportunities for improvement" (positive).

    Take a look at "perlperf - Perl Performance and Optimization Techniques". There's a lot of information on benchmarking and profiling tools. Use these to determine what's fast and what's slow, where bottlenecks occur, and so on. This is a much better approach than going on gut-feeling, anecdotal evidence, and the like.

    My $work often involves dealing with biological data (tends to be measured in GB, rather than MB). Functions which return large datasets are a red-flag to me; references to such data are nearly always a better choice.

    I had thought that queries like "(hollow log)|(fence)" would result in regexes like "/(?:hollow log|fence)/". There was no indication that anything more complex was involved. Your new information indicates that's not the case. For your keyword searches, I'd still recommend index(); when using anchors (^, \b, etc.), and such like, regexes are probably the correct approach.

    I recommend you change "Use PERL regex" to "Use Perl regex": Perl is the language; perl is the program; PERL is not a thing. :-)

    Good luck with your continued optimisation efforts; and, of course, do ask if you need further help.

    — Ken