Generate a regex from text

porg has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Generate a regex from text by LanX (Saint) on Feb 01, 2017 at 15:42 UTC
> Is there some intrinsic logical flaw in trying to generate a regex like this? Yes and no. Without better explanation and example data it's a Schrödinger question. With this description the answer is easy `$re = join "\|", map quotemeta, @column` gives you a regex which is matching any element from a column with `m/^$re$/` Could you be more precise, this sounds very much like a XY Problem Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l] [select]
Re^2: Generate a regex from text by haukex (Archbishop) on Feb 01, 2017 at 15:50 UTC
Hi Rolf & porg, `$re = join "\|", map quotemeta, @column` [download] Shameless plug - I recently wrote the node Building Regex Alternations Dynamically on exactly this topic (keep in mind it is still a draft). Also, if the number of rows in the DB is large, this may result in a giant regex. Hope this helps, -- Hauke D	[reply] [d/l]
Re^3: Generate a regex from text by LanX (Saint) on Feb 01, 2017 at 16:18 UTC
> Also, if the number of rows in the DB is large, this may result in a giant regex. Well one hits the limit for the trie optimization, splitting into multiple regexes should be done. Since `$x =~ /a\|b\|c\|d/` is like `$x =~ /a\|b/ or $x =~ /c\|d/` i.e. it'll scale well. But that's lost energy as long as the OP doesn't clarify his energy. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l] [select]
Re^3: Generate a regex from text by porg (Novice) on Feb 01, 2017 at 16:25 UTC
Thanks alot Hauke! This looks similar to what I'm looking for.	[reply]
Re^4: Generate a regex from text by haukex (Archbishop) on Feb 01, 2017 at 16:33 UTC
Re^3: Generate a regex from text by LanX (Saint) on Feb 01, 2017 at 15:55 UTC
> Shameless plug Yes ... and all reports that this technique was already in use before you joined the monastery are alternative facts! ;) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^4: Generate a regex from text by haukex (Archbishop) on Feb 01, 2017 at 16:18 UTC
Re^5: Generate a regex from text by LanX (Saint) on Feb 01, 2017 at 16:32 UTC
Some notes below your chosen depth have not been shown here
Re^2: Generate a regex from text by porg (Novice) on Feb 01, 2017 at 16:23 UTC
Sorry for the intentional vagueness, I was hoping to make the concept transferable to different types of textual data. There is a variety of data i want to apply this to - see the end for examples. Further information: I'm interfacing data between 2 systems. The 2 systems have similar, but not precisely the same data formatting requirements between them. In reality, one system didn't have any enforced formatting requirements, but the target system has some formatting requirements that changed over time (without changing the old data, of course). The owners of the target don't want to change the old data. It's a lot of data... as in too big for the excel-monkeys to do their vlookups :\| . They want to keep the data the same between systems as much as possible, and identify the outliers separately for further processing. I've developed the rest of the interface logic in perl, so the "E and L" from "ETL". Right now I'm just working on data transformations and am curious if I can programmatically create a regex that will rigidly match the data. DB 1: `PRESENTATIONM \| BEN_CODE\| PHONE \| --------------\|---------\|---------------\| John DoE \| ABC123 \|1-233-123-4562 \| Jo M. Doeson \| abd123 \|(222)222-2222 \| Mc'Doe, Jim \| abd123 \|222-222-2222 \| MCDOE, JAN E. \| abd1243 \|(222)222-2ab2 \|` [download] `Name \| BEN_CODE\| PHONE \| --------------\|---------\|---------------\| Doe, John \| Z-AB123 \|+12331234562 \| Doeson, Jo M. \| X-AB1 \|+12341234562 \| Mc'Doe, Jim \| G-123 \|+12331255562 \| MCDOE, JAN E. \| ABC123 \|(222)222-2222 \|` [download] Note: There is other data that is not so obvious for formatting (i.e. logs, country identifier requirements, etc) Now, you might be wondering, "Why don't you just take the target system, get the requirements for the field formatting/info and apply that regex to the source system's data"? Well, that's a great question. The reason why is that "they" want to develop formatting requirements based on the largest amount of data formatting present, and between multiple future databases that i don't have requirements information for and so will need to develop those requirements on the fly per system. So a sort of "data democratic vote" for formatting. Needless to say, I realize the problems with this.... please let me know if that's enough information to remove this from Schrodinger's moronic QM interpretation to a more Standard Model understanding of the universe.	[reply] [d/l] [select]
Re^3: Generate a regex from text by poj (Abbot) on Feb 01, 2017 at 17:43 UTC
Not sure if this is of any help but I offer it anyway !. Basic idea is to build patterns of characters/numbers and count them to show the most common. I've applied it to all your data but you would probably want to do each column separately #!perl use strict; my %pattern = (); my @data = <DATA>; # indentify patterns for (@data){ chomp; tr/A-Za-z/A/; tr/0-9/9/; $_ =~ s/A+/[A-Za-z]+/g; $_ =~ s/9+/[0-9]+/g; ++$pattern{$_}; # count duplicates } # results for (sort keys %pattern){ print "$pattern{$_} $_\n"; } # build regex my $re = join '\|',map { quotemeta } sort keys %pattern; # check data against regex for (@data){ unless (/^($re)$/){ print "No match $_\n"; } } __DATA__ John DoE ABC123 1-233-123-4562 Jo M. Doeson abd123 (222)222-2222 Mc'Doe, Jim abd123 222-222-2222 MCDOE, JAN E. abd1243 (222)222-2ab2 [download] poj	[reply] [d/l]
Re: Generate a regex from text by Anonymous Monk on Feb 01, 2017 at 19:06 UTC
Instead of trying to find regexes to match values, why not instead take the values to some canonical form? If Value A and Value B both transform to some canonical form C1, then they match. CPAN turns up: String::Canonical and String::Normal.	[reply]
Re^2: Generate a regex from text by Anonymous Monk on Feb 01, 2017 at 20:36 UTC
this is the sane method	[reply]
Re^2: Generate a regex from text by porg (Novice) on Feb 04, 2017 at 03:30 UTC
This is an intriguing concept. However, I don't understand how this would be better than using regex? isn't regex also 'philosophically' speaking just finding the canonical representation of the data using patterns?	[reply]