in reply to Re^2: Supervised machine learning algo for text matching across two files
in thread Supervised machine learning algo for text matching across two files

Hello again Anonymous Monk,

Well to be honest I do not see any way out of my mind on matching HCBS_max and National healthcare basketball society. So I can not really say that this could be done with comparing data.

Are you able to manipulate this files while the data are populated inside of them?

If so you could add based on conditions abbreviations.

Seeking for Perl wisdom...on the process of learning...not there...yet!

Replies are listed 'Best First'.
Re^4: Supervised machine learning algo for text matching across two files
by AnomalousMonk (Archbishop) on May 24, 2017 at 21:08 UTC

    I'm with thanos1983 on this one. 'National healthcare basketball society' maps to 'HCBS_max'?!? Wow! If anyone figures out a solution to this one, please let me know; I'd sure like to go in with you on patenting/exploiting it!


    Give a man a fish:  <%-{-{-{-<

      You can add a feature like "if you split the long string to words based on a dictionary and extract first letters, you'll get part of the abbreviation." Then let the algorithm decide whether it's useful or not. Similarly, you can train the algorithm on a large corpus of downloaded texts, maybe the fact that the words tend to appear in the same article could be used as a feature, too (or at least some number expressing their collocability).

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re^4: Supervised machine learning algo for text matching across two files
by Anonymous Monk on May 24, 2017 at 20:46 UTC
    Since I have 15% matched to use as ground truth... can't I somehow use the other 50 columns in the file ( all of which has various data fields) to train some sort of supervised approach that uses all the data to suggest a match statistically?

    I want to say the answer has something to do with Expectation Maximization type approach but I'm way out of my depth here.