|We don't bite newbies here... much
Re^2: Efficient regex search on array tableby Polyglot (Chaplain)
|on Dec 16, 2022 at 10:23 UTC
They need to add a trophy or star function for superior posts like yours. Upvoting seems poor recompense for the amount of effort you put into that. Thank you--and you correctly discerned some of my failings.
Yes, I was intending to remove leading and trailing whitespace. The reason for this is that a space on either side will throw off the matching for searches in which the user specified that the match must occur at the beginning or at the end. So the space removal is for the benefit of the regex later.
And, yes, each array processed line-by-line will actually have between 4 MB (at the lowest end) and around 250+ MB for one particular annotated version (with full HTML mouseovers, etc.); but the average being closer to 10 MB each. So you were correct that each one is over a MB. These are all coming from a database, each file represented in a separate table in the DB. The routine which feeds the array pulls every row of the table at once, and this is done to speed up the database portion, by not having to use 30,000+ calls to the DB, one per row, and it was also my understanding that it was less expensive, time-wise, to use some RAM than to make repeated I/O calls. I may be mistaken--you seem to have a good grasp of these things, so feel free to clarify.
The clients have two options for forming their query--and these options are individually available on a per-column basis: 1) they can use a simple, standard search, entering a keyword or phrase of their choice, then ticking checkboxes for case-sensitivity, whole-word (\bwhole-word\b) searching, must match at beginning or end, etc.; and 2) they can tick the "Use PERL regex" option which then disables all the other options and they are on their own with specifying what they want to match via formulation of their own regular expression. The subroutine I call for returning the regex handles both alternatives, returning in qr// form.
I will try out your code when I have a chance--probably won't be for another couple of days until my next window of opportunity. I very much appreciate your effort.
By the way, I didn't see much, if any, improvement with the addition of the "o" (m//o) for matching. I think this might be because the $regex is already in qr// form--but perhaps I'm simply not aware of how that affects things.
P.S. Oh, and by the way, I'm developing on Perl 5.12.4.