There are definitely key words that can match but they are not possible to match with regex. A random example i made up would be: file 1: HCBS_max, file 2: National healthcare basketball society . In this case there is an acronym and intuitively I can google both and then decide that ok these are the same let me do the match manually. I could make a regex rule that would search for acronyms sure, but there is no obvious patterning like this... its all human entered data and thus all over the place with no standardization.
What I am thinking is that I can use the other 50 columns in the file 1 and search for patterns and associations that are not intuitive but none the less help me to classify some of the matches. Is this what a random forest can do potentially, utilizing the 15% of "ground truth" data I have as a training set?
In reply to Re^2: Supervised machine learning algo for text matching across two files
by Anonymous Monk
in thread Supervised machine learning algo for text matching across two files
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |