in reply to Generate a regex from text

> Is there some intrinsic logical flaw in trying to generate a regex like this?

Yes and no. Without better explanation and example data it's a Schrödinger question.

With this description the answer is easy

 $re = join "|", map quotemeta, @column

gives you a regex which is matching any element from a column with m/^$re$/

Could you be more precise, this sounds very much like a XY Problem

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Je suis Charlie!

Replies are listed 'Best First'.
Re^2: Generate a regex from text
by haukex (Archbishop) on Feb 01, 2017 at 15:50 UTC

    Hi Rolf & porg,

    $re = join "|", map quotemeta, @column

    Shameless plug - I recently wrote the node Building Regex Alternations Dynamically on exactly this topic (keep in mind it is still a draft). Also, if the number of rows in the DB is large, this may result in a giant regex.

    Hope this helps,
    -- Hauke D

      > Also, if the number of rows in the DB is large, this may result in a giant regex.

      Well one hits the limit for the trie optimization, splitting into multiple regexes should be done.

      Since   $x =~ /a|b|c|d/ is like  $x =~ /a|b/ or $x =~ /c|d/

      i.e. it'll scale well.

      But that's lost energy as long as the OP doesn't clarify his energy.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

      Thanks alot Hauke! This looks similar to what I'm looking for.

        Hi LanX,

        all reports that this technique was already in use before you joined the monastery are alternative facts

        I never claimed I invented the technique :-P Relevant?

        Regards,
        -- Hauke D

Re^2: Generate a regex from text
by porg (Novice) on Feb 01, 2017 at 16:23 UTC

    Sorry for the intentional vagueness, I was hoping to make the concept transferable to different types of textual data. There is a variety of data i want to apply this to - see the end for examples.

    Further information: I'm interfacing data between 2 systems. The 2 systems have similar, but not precisely the same data formatting requirements between them. In reality, one system didn't have any enforced formatting requirements, but the target system has some formatting requirements that changed over time (without changing the old data, of course). The owners of the target don't want to change the old data.
    It's a lot of data... as in too big for the excel-monkeys to do their vlookups :| .
    They want to keep the data the same between systems as much as possible, and identify the outliers separately for further processing.

    I've developed the rest of the interface logic in perl, so the "E and L" from "ETL". Right now I'm just working on data transformations and am curious if I can programmatically create a regex that will rigidly match the data.

    DB 1:

    PRESENTATIONM | BEN_CODE| PHONE | --------------|---------|---------------| John DoE | ABC123 |1-233-123-4562 | Jo M. Doeson | abd123 |(222)222-2222 | Mc'Doe, Jim | abd123 |222-222-2222 | MCDOE, JAN E. | abd1243 |(222)222-2ab2 |

    Name | BEN_CODE| PHONE | --------------|---------|---------------| Doe, John | Z-AB123 |+12331234562 | Doeson, Jo M. | X-AB1 |+12341234562 | Mc'Doe, Jim | G-123 |+12331255562 | MCDOE, JAN E. | ABC123 |(222)222-2222 |
    Note: There is other data that is not so obvious for formatting (i.e. logs, country identifier requirements, etc)

    Now, you might be wondering, "Why don't you just take the target system, get the requirements for the field formatting/info and apply that regex to the source system's data"? Well, that's a great question. The reason why is that "they" want to develop formatting requirements based on the largest amount of data formatting present, and between multiple future databases that i don't have requirements information for and so will need to develop those requirements on the fly per system. So a sort of "data democratic vote" for formatting. Needless to say, I realize the problems with this.... please let me know if that's enough information to remove this from Schrodinger's moronic QM interpretation to a more Standard Model understanding of the universe.

      Not sure if this is of any help but I offer it anyway !. Basic idea is to build patterns of characters/numbers and count them to show the most common. I've applied it to all your data but you would probably want to do each column separately

      #!perl use strict; my %pattern = (); my @data = <DATA>; # indentify patterns for (@data){ chomp; tr/A-Za-z/A/; tr/0-9/9/; $_ =~ s/A+/[A-Za-z]+/g; $_ =~ s/9+/[0-9]+/g; ++$pattern{$_}; # count duplicates } # results for (sort keys %pattern){ print "$pattern{$_} $_\n"; } # build regex my $re = join '|',map { quotemeta } sort keys %pattern; # check data against regex for (@data){ unless (/^($re)$/){ print "No match $_\n"; } } __DATA__ John DoE ABC123 1-233-123-4562 Jo M. Doeson abd123 (222)222-2222 Mc'Doe, Jim abd123 222-222-2222 MCDOE, JAN E. abd1243 (222)222-2ab2
      poj