Re^2: Matching alphabetic diacritics with Perl and Postgresql

Replies are listed 'Best First'.
Re^3: Matching alphabetic diacritics with Perl and Postgresql by chacham (Prior) on Jun 03, 2017 at 20:03 UTC
I see you have thought this one out. I would still like to suggest that you add a staging table and copy the data into it before processing, and then do all you processing in sql. This should avoid the issues as the database itself will be doing the comparisons, and with where not exists, at the same time as the insertion. If there are multiple steps, a stored procedure can be used, or a series stored procedures called from another one, should you want separate entry points.	[reply]
Re^4: Matching alphabetic diacritics with Perl and Postgresql by anonymized user 468275 (Curate) on Jun 03, 2017 at 23:56 UTC
Staging table means all my efforts to clean the data have to be migrated from Perl to SQL. This is particularly awkward for location. To take an example, the registered office at Companies House might be as daft as: Country of Origin: Channel Islands RegAddressCountry: GY1 4PW RegAddressCounty: St. Peterport PostTown: Address Line1:MR FRED BLOGGS Address Line 2: The Company Name CompanyName: A variation of the company name Care of: Postcode: So the algorithm to clean the data is broadly: delete or translate some countries (Channel Islands is deleted because it isn't a country. But postcode GY1 implies the country is Guernsey. For my model, the county is also Guernsey as opposed to Alderney. If more than one level is on the same line they have to be split. If data is at too high a level, e.g. the registered country was mistakenly put in Country of Origin and then everything moved up one from where it should be, defaults have to be pushed on the stack moving the location lines down (largest location items being higher in the list, the way I am doing it). So I have hashes of common translations e.g. if Curacao has a sedilla whereas my country table does not have sedillas and BVI gets unabbreviated to British Virgin Islands. The number of ways Ireland gets spelled is particularly astounding ROI, Rep of. etc.. Just about every Republic of something needs different translations there are even several ways for China. Mis-spellings like United Kinmod etc. Scotnadl, Isalnds, Isles-> Islands all have to be corrected, there's a method that splits location lines where pattern matching is needed, a list of deletable countries including Channel Islands and British West Indies to force the next level to be promoted to a country, a list of pushable items where the country is missing already, like Leicester being entered as a country in the register and the facility to assume registered country is the same as the country of origin for some cases. I could go on forwever, but it's about 200 lines of OO Perl and would be about 100 pages in SQL if I load it in before processing. Update: Yes, I could call the Perl from Postgres after staging, at least I can do that in my dev. env., but I can't expect that to hold true when it gets ported to the hosting env. One world, one people	[reply]
Re^5: Matching alphabetic diacritics with Perl and Postgresql by chacham (Prior) on Jun 04, 2017 at 03:33 UTC
I see. I'll just add my 2 cents for your perusal. Batch processing with loaded data is almost always better in the database simply because it is made for it. It ends up being faster because systems are not switched, the database can do things in batches (as opposed to line by line) and character issues are unlikely to be coming up. So, before you even start, you're already ahead. I'm relatively sure you can do it a few statements too, if done correctly. I have done things like this before, and on one job, i created a batch table to make record (and time) the different steps, allowing multiple batches to be done at the same time (because the steps do not have anything to do with the batches, per se). If there are common misspellings, i would likely use a translation table to change misspellings to the used spelling, or move a country to the next step. A small query would do that (perhaps as part of a larger query). Anything not found would be flagged for human perusal, which would happen less and less with each run. Having the translation table in the database is likely better for managing it anyway, and you can even keep fun numbers like how often each shows up. To remove spaces and new lines, postgres has replace functions, including regex. That's one very simple query. In my experience, the sql code is smaller, not larger, if written correctly. However, if you already have it written in perl, there's no reason to reinvent the wheel, unless a large problem comes up, or you want/need processing to take (what might be) a fraction of the time.	[reply]
Re^6: Matching alphabetic diacritics with Perl and Postgresql by marinersk (Priest) on Jun 04, 2017 at 10:19 UTC
Re^6: Matching alphabetic diacritics with Perl and Postgresql by anonymized user 468275 (Curate) on Jun 04, 2017 at 11:18 UTC
Re^7: Matching alphabetic diacritics with Perl and Postgresql by chacham (Prior) on Jun 04, 2017 at 20:39 UTC
Some notes below your chosen depth have not been shown here