in reply to Re^3: Matching alphabetic diacritics with Perl and Postgresql
in thread Matching alphabetic diacritics with Perl and Postgresql
Country of Origin: Channel Islands
RegAddressCountry: GY1 4PW
RegAddressCounty: St. Peterport
PostTown:
Address Line1:MR FRED BLOGGS
Address Line 2: The Company Name
CompanyName: A variation of the company name
Care of:
Postcode:
So the algorithm to clean the data is broadly: delete or translate some countries (Channel Islands is deleted because it isn't a country. But postcode GY1 implies the country is Guernsey. For my model, the county is also Guernsey as opposed to Alderney. If more than one level is on the same line they have to be split. If data is at too high a level, e.g. the registered country was mistakenly put in Country of Origin and then everything moved up one from where it should be, defaults have to be pushed on the stack moving the location lines down (largest location items being higher in the list, the way I am doing it). So I have hashes of common translations e.g. if Curacao has a sedilla whereas my country table does not have sedillas and BVI gets unabbreviated to British Virgin Islands. The number of ways Ireland gets spelled is particularly astounding ROI, Rep of. etc.. Just about every Republic of something needs different translations there are even several ways for China. Mis-spellings like United Kinmod etc. Scotnadl, Isalnds, Isles-> Islands all have to be corrected, there's a method that splits location lines where pattern matching is needed, a list of deletable countries including Channel Islands and British West Indies to force the next level to be promoted to a country, a list of pushable items where the country is missing already, like Leicester being entered as a country in the register and the facility to assume registered country is the same as the country of origin for some cases. I could go on forwever, but it's about 200 lines of OO Perl and would be about 100 pages in SQL if I load it in before processing.
Update: Yes, I could call the Perl from Postgres after staging, at least I can do that in my dev. env., but I can't expect that to hold true when it gets ported to the hosting env.
One world, one people
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: Matching alphabetic diacritics with Perl and Postgresql
by chacham (Prior) on Jun 04, 2017 at 03:33 UTC | |
by marinersk (Priest) on Jun 04, 2017 at 10:19 UTC | |
by anonymized user 468275 (Curate) on Jun 04, 2017 at 11:18 UTC | |
by chacham (Prior) on Jun 04, 2017 at 20:39 UTC | |
by anonymized user 468275 (Curate) on Jun 05, 2017 at 14:12 UTC | |
by chacham (Prior) on Jun 05, 2017 at 20:22 UTC |