Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

Re: Efficient regex search on array table

by dorko (Prior)
on Dec 14, 2022 at 21:14 UTC ( [id://11148866] : note . print w/replies, xml ) Need Help??

in reply to Efficient regex search on array table

Hello Polyglot,

I'm not sure if this will buy you any improvements, but you might try the qr operator. You might even already be using qr depending on what &composeRegex returns.

Quoting from the documentation:


This operator quotes (and possibly compiles) its STRING as a regular expression. STRING is interpolated the same way as PATTERN in m/PATTERN/.

I think this example (also from the documentation) gives a good idea of what is possible.

my $sentence_rx = qr{ (?: (?<= ^ ) | (?<= \s ) ) # after start-of-string or # whitespace \p{Lu} # capital letter .*? # a bunch of anything (?<= \S ) # that ends in non- # whitespace (?<! \b [DMS]r ) # but isn't a common abbr. (?<! \b Mrs ) (?<! \b Sra ) (?<! \b St ) [.?!] # followed by a sentence # ender (?= $ | \s ) # in front of end-of-string # or whitespace }sx; local $/ = ""; while (my $paragraph = <>) { say "NEW PARAGRAPH"; my $count = 0; while ($paragraph =~ /($sentence_rx)/g) { printf "\tgot sentence %d: <%s>\n", ++$count, $1; } }

I hope this helps, even if only a little.



-- Yeah, I'm a Delt.

Replies are listed 'Best First'.
Re^2: Efficient regex search on array table
by Polyglot (Chaplain) on Dec 15, 2022 at 05:35 UTC
    Well, here are the almost-humorous results.

    I was able to use the qr// on the output of the composeRegex subroutine--so the implementation was a cinch. After doing so, I found that my return times hovered around 11 or 12 seconds. I increased the regex complexity--almost no difference to the time. I began to have hope that something had really improved. To see how much difference it had made, I again removed the qr// from the subroutine's output and tried the identical query again....11 or 12 seconds--the same as before.

    It was at this moment that I recalled having discovered a "wine" process running rampant and using cpu resources at full throttle earlier this morning, which I had killed, of course. Hah! Sluggish response just might have something to do with who else is hogging the cpu.

    And as for MacOSX having a penchant for dependency on "wine," I say it's better to stay away from the inebriating stuff.

    Unfortunately, in terms of breaking the 11-second barrier, that has not yet happened.



      G'day Polyglot,

      Take a look at the benchmark section (at end, in spoiler) of my "Regex /o modifier: what bugs?" post from earlier today. It may give you some insight into why qr// is not providing any efficiency improvement.

      I suggest writing your own benchmark with $str and $re more closely reflecting your real code and data. use_m() & as_qr() would likely be useful for you; use_o() may be helpful but that depends on parts of your code that you haven't shown; I don't think raw_m() or raw_o() are relevant for you.

      By the way, I had mentioned that I was using Perl v5.36 which could be providing optimisations; what version of Perl are you using?

      — Ken

Re^2: Efficient regex search on array table
by Polyglot (Chaplain) on Dec 15, 2022 at 03:00 UTC
    Thank you for this advice. I was almost sure I was using qr// somewhere in the code...I even remember making some recent use of it. But I went back and looked at that subroutine, and then searched through all parts of the script, and found that I had made very little use of qr//, and none in portions I had recently worked with. What the composeRegex subroutine is doing is basically my own version of quotemeta. As it was sometime back that I labored through that portion, I know I had tried quotemeta and had some issues with it, which is why I ended up doing my own thing--but I don't remember now what the issues were. It probably had something to with characters that I did not want to have escaped that quotemeta would have escaped.

    I will consider rewriting that portion using qr//. However, I do wonder if that will gain much on the time, as it only executes once per regex, maximum of five times for the entire script run. The time savings would have to be in how the regex was applied during the matches.

    I will give it a try--but that part is a bit complex, so I'm not sure yet how it will work out.