comment on

Sorry for the intentional vagueness, I was hoping to make the concept transferable to different types of textual data. There is a variety of data i want to apply this to - see the end for examples.

Further information: I'm interfacing data between 2 systems. The 2 systems have similar, but not precisely the same data formatting requirements between them. In reality, one system didn't have any enforced formatting requirements, but the target system has some formatting requirements that changed over time (without changing the old data, of course). The owners of the target don't want to change the old data.
It's a lot of data... as in too big for the excel-monkeys to do their vlookups :| .
They want to keep the data the same between systems as much as possible, and identify the outliers separately for further processing.

I've developed the rest of the interface logic in perl, so the "E and L" from "ETL". Right now I'm just working on data transformations and am curious if I can programmatically create a regex that will rigidly match the data.

DB 1:

PRESENTATIONM | BEN_CODE| PHONE         |
--------------|---------|---------------|
John DoE      | ABC123  |1-233-123-4562 |
Jo M. Doeson  | abd123  |(222)222-2222  | 
Mc'Doe, Jim   | abd123  |222-222-2222   |
MCDOE, JAN E. | abd1243 |(222)222-2ab2  |
[download]

Name          | BEN_CODE| PHONE         |
--------------|---------|---------------|
Doe, John     | Z-AB123 |+12331234562   |
Doeson, Jo M. | X-AB1   |+12341234562   | 
Mc'Doe, Jim   | G-123   |+12331255562   |
MCDOE, JAN E. | ABC123  |(222)222-2222  |
[download]

Note: There is other data that is not so obvious for formatting (i.e. logs, country identifier requirements, etc)

Now, you might be wondering, "Why don't you just take the target system, get the requirements for the field formatting/info and apply that regex to the source system's data"? Well, that's a great question. The reason why is that "they" want to develop formatting requirements based on the largest amount of data formatting present, and between multiple future databases that i don't have requirements information for and so will need to develop those requirements on the fly per system. So a sort of "data democratic vote" for formatting. Needless to say, I realize the problems with this.... please let me know if that's enough information to remove this from Schrodinger's moronic QM interpretation to a more Standard Model understanding of the universe.

In reply to Re^2: Generate a regex from text by porg
in thread Generate a regex from text by porg

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.