I need some help thinking through what type of algorithm I need to build and some suggestions on the best way to do it in Perl.
I've got two files and the objective is to match the rows in file 1 with an ID in a row in file 2.File 1 has millions of rows and about 55 fields, but there is no one field that is capable of matching to anything in file 2 directly, despite in the real-world there being a very real 1:1 mapping for each row. From manual matching efforts of content experts I have a list of about 15% matched but the method to do this cannot be used any further thus I would like to use this 13% as a training set to build a machine learning algorithm that could help match the rest. I am thinking the right algorithm type is a multi-class classifier.
My hypothesis is that throughout the 15% matched there are patterns that are not easy to see that an algorithm like a decision forest or something could help to tease out and help get the match rate from 15% up to 30% etc...Any comments or suggestions are tremendously helpful. Cheers!
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |