porg has asked for the wisdom of the Perl Monks concerning the following question:

I have 2 databases with a column in each where the data is somewhat similar. I want to pull all the data from these columns and generate a regular expression that matches the first DB's data, and a separate regex that matches the second DB's data.

Note- I want to generate this FROM analyzing the text of each DB's data.

(Not relevant): I then want to find the difference between these 2 regexes (so the cases that do not match both -- so an intersection). I would assume i do this via some kind of negation after combining the regexes for both sets of data.

I tried the usual resources (searching monks/google, SO, etc), but I couldn't find anything perl-specific for generating a regex FROM the actual data. Is there some intrinsic logical flaw in trying to generate a regex like this?

Please note, I DO have a bunch of resources I found on PerlMonks that will enable me to develop this myself, I'm mainly curious if this use case's solution is already encompassed in some module.


Thanks!

Replies are listed 'Best First'.
Re: Generate a regex from text
by LanX (Saint) on Feb 01, 2017 at 15:42 UTC
    > Is there some intrinsic logical flaw in trying to generate a regex like this?

    Yes and no. Without better explanation and example data it's a Schrödinger question.

    With this description the answer is easy

     $re = join "|", map quotemeta, @column

    gives you a regex which is matching any element from a column with m/^$re$/

    Could you be more precise, this sounds very much like a XY Problem

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      Hi Rolf & porg,

      $re = join "|", map quotemeta, @column

      Shameless plug - I recently wrote the node Building Regex Alternations Dynamically on exactly this topic (keep in mind it is still a draft). Also, if the number of rows in the DB is large, this may result in a giant regex.

      Hope this helps,
      -- Hauke D

        > Also, if the number of rows in the DB is large, this may result in a giant regex.

        Well one hits the limit for the trie optimization, splitting into multiple regexes should be done.

        Since   $x =~ /a|b|c|d/ is like  $x =~ /a|b/ or $x =~ /c|d/

        i.e. it'll scale well.

        But that's lost energy as long as the OP doesn't clarify his energy.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

        Thanks alot Hauke! This looks similar to what I'm looking for.

      Sorry for the intentional vagueness, I was hoping to make the concept transferable to different types of textual data. There is a variety of data i want to apply this to - see the end for examples.

      Further information: I'm interfacing data between 2 systems. The 2 systems have similar, but not precisely the same data formatting requirements between them. In reality, one system didn't have any enforced formatting requirements, but the target system has some formatting requirements that changed over time (without changing the old data, of course). The owners of the target don't want to change the old data.
      It's a lot of data... as in too big for the excel-monkeys to do their vlookups :| .
      They want to keep the data the same between systems as much as possible, and identify the outliers separately for further processing.

      I've developed the rest of the interface logic in perl, so the "E and L" from "ETL". Right now I'm just working on data transformations and am curious if I can programmatically create a regex that will rigidly match the data.

      DB 1:

      PRESENTATIONM | BEN_CODE| PHONE | --------------|---------|---------------| John DoE | ABC123 |1-233-123-4562 | Jo M. Doeson | abd123 |(222)222-2222 | Mc'Doe, Jim | abd123 |222-222-2222 | MCDOE, JAN E. | abd1243 |(222)222-2ab2 |

      Name | BEN_CODE| PHONE | --------------|---------|---------------| Doe, John | Z-AB123 |+12331234562 | Doeson, Jo M. | X-AB1 |+12341234562 | Mc'Doe, Jim | G-123 |+12331255562 | MCDOE, JAN E. | ABC123 |(222)222-2222 |
      Note: There is other data that is not so obvious for formatting (i.e. logs, country identifier requirements, etc)

      Now, you might be wondering, "Why don't you just take the target system, get the requirements for the field formatting/info and apply that regex to the source system's data"? Well, that's a great question. The reason why is that "they" want to develop formatting requirements based on the largest amount of data formatting present, and between multiple future databases that i don't have requirements information for and so will need to develop those requirements on the fly per system. So a sort of "data democratic vote" for formatting. Needless to say, I realize the problems with this.... please let me know if that's enough information to remove this from Schrodinger's moronic QM interpretation to a more Standard Model understanding of the universe.

        Not sure if this is of any help but I offer it anyway !. Basic idea is to build patterns of characters/numbers and count them to show the most common. I've applied it to all your data but you would probably want to do each column separately

        #!perl use strict; my %pattern = (); my @data = <DATA>; # indentify patterns for (@data){ chomp; tr/A-Za-z/A/; tr/0-9/9/; $_ =~ s/A+/[A-Za-z]+/g; $_ =~ s/9+/[0-9]+/g; ++$pattern{$_}; # count duplicates } # results for (sort keys %pattern){ print "$pattern{$_} $_\n"; } # build regex my $re = join '|',map { quotemeta } sort keys %pattern; # check data against regex for (@data){ unless (/^($re)$/){ print "No match $_\n"; } } __DATA__ John DoE ABC123 1-233-123-4562 Jo M. Doeson abd123 (222)222-2222 Mc'Doe, Jim abd123 222-222-2222 MCDOE, JAN E. abd1243 (222)222-2ab2
        poj
Re: Generate a regex from text
by Anonymous Monk on Feb 01, 2017 at 19:06 UTC
    Instead of trying to find regexes to match values, why not instead take the values to some canonical form? If Value A and Value B both transform to some canonical form C1, then they match. CPAN turns up: String::Canonical and String::Normal.
      this is the sane method
      This is an intriguing concept. However, I don't understand how this would be better than using regex? isn't regex also 'philosophically' speaking just finding the canonical representation of the data using patterns?