Jabber_tango has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Has anybody got some sample code that could assist me with a project please? I have the following.. 1 Excel file with anything upto 100,000 rows and multiple columns. 2 A text file with list of regular expressions in (Some 100+) I need to run all the reg expressions, and if entire expression match send to a file "Matched" If partial expressions matched send to a file "Partial" Does anybody have a nice clean way to do this? Or suggestions please? I have used Perl in the past extensively, but not for a few years. Many Thanks Mark

Replies are listed 'Best First'.
Re: Regex on excel
by AnomalousMonk (Archbishop) on Aug 06, 2020 at 17:24 UTC
    ... list of regular expressions ... (Some 100+) ...

    Another possible approach to the problem you've described (in very general terms!) is discussed in haukex's article Building Regex Alternations Dynamically. You could conceivably process 100K+ rows from your spreadsheet in a single pass. This technique should be able to handle a couple of hundred regexes of reasonable size, but the boundary conditions of each regex might cause problems. Depending on the precise definition of a "partial" versus an "entire" match, this distinction should also be fairly easily managable.

    Please see also Short, Self-Contained, Correct Example and How to ask better questions using Test::More and sample data.


    Give a man a fish:  <%-{-{-{-<

Re: Regex on excel
by Corion (Patriarch) on Aug 06, 2020 at 14:04 UTC
Re: Regex on excel
by jcb (Parson) on Aug 07, 2020 at 02:23 UTC

    I see a logical problem here: what does a "partial" match mean?

    Perl regexes either match or they do not match; Perl does not recognize "partial" matching. Are you wanting to sort by whether a regex matches an entire field value or part of a field value?

Re: Regex on excel
by perlfan (Parson) on Aug 11, 2020 at 02:16 UTC
    Export that to CSV. There's no way to avoid the 100,000 x 100 matches unless you can reduce the number of regexes based on mutual exclusion or overlap. I find it hard to believe you need all 100 regular expressions. Similarly, you may be able to create some sort of trie that allows you to avoid regexes you know will not match whole subtrees.