in reply to Re: Generate a regex from text
in thread Generate a regex from text
Sorry for the intentional vagueness, I was hoping to make the concept transferable to different types of textual data. There is a variety of data i want to apply this to - see the end for examples.
Further information: I'm interfacing data between 2 systems. The 2 systems have similar, but not precisely the same data formatting requirements between them. In reality, one system didn't have any enforced formatting requirements, but the target system has some formatting requirements that changed over time (without changing the old data, of course). The owners of the target don't want to change the old data.
It's a lot of data... as in too big for the excel-monkeys to do their vlookups :| .
They want to keep the data the same between systems as much as possible, and identify the outliers separately for further processing.
I've developed the rest of the interface logic in perl, so the "E and L" from "ETL". Right now I'm just working on data transformations and am curious if I can programmatically create a regex that will rigidly match the data.
DB 1:
PRESENTATIONM | BEN_CODE| PHONE | --------------|---------|---------------| John DoE | ABC123 |1-233-123-4562 | Jo M. Doeson | abd123 |(222)222-2222 | Mc'Doe, Jim | abd123 |222-222-2222 | MCDOE, JAN E. | abd1243 |(222)222-2ab2 |
Note: There is other data that is not so obvious for formatting (i.e. logs, country identifier requirements, etc)Name | BEN_CODE| PHONE | --------------|---------|---------------| Doe, John | Z-AB123 |+12331234562 | Doeson, Jo M. | X-AB1 |+12341234562 | Mc'Doe, Jim | G-123 |+12331255562 | MCDOE, JAN E. | ABC123 |(222)222-2222 |
Now, you might be wondering, "Why don't you just take the target system, get the requirements for the field formatting/info and apply that regex to the source system's data"? Well, that's a great question. The reason why is that "they" want to develop formatting requirements based on the largest amount of data formatting present, and between multiple future databases that i don't have requirements information for and so will need to develop those requirements on the fly per system. So a sort of "data democratic vote" for formatting. Needless to say, I realize the problems with this.... please let me know if that's enough information to remove this from Schrodinger's moronic QM interpretation to a more Standard Model understanding of the universe.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Generate a regex from text
by poj (Abbot) on Feb 01, 2017 at 17:43 UTC |