in reply to Re: Supervised machine learning algo for text matching across two files
in thread Supervised machine learning algo for text matching across two files

Hey there thanks for the responses !!

There are definitely key words that can match but they are not possible to match with regex. A random example i made up would be: file 1: HCBS_max, file 2: National healthcare basketball society . In this case there is an acronym and intuitively I can google both and then decide that ok these are the same let me do the match manually. I could make a regex rule that would search for acronyms sure, but there is no obvious patterning like this... its all human entered data and thus all over the place with no standardization.

What I am thinking is that I can use the other 50 columns in the file 1 and search for patterns and associations that are not intuitive but none the less help me to classify some of the matches. Is this what a random forest can do potentially, utilizing the 15% of "ground truth" data I have as a training set?

  • Comment on Re^2: Supervised machine learning algo for text matching across two files

Replies are listed 'Best First'.
Re^3: Supervised machine learning algo for text matching across two files
by thanos1983 (Parson) on May 24, 2017 at 20:37 UTC

    Hello again Anonymous Monk,

    Well to be honest I do not see any way out of my mind on matching HCBS_max and National healthcare basketball society. So I can not really say that this could be done with comparing data.

    Are you able to manipulate this files while the data are populated inside of them?

    If so you could add based on conditions abbreviations.

    Seeking for Perl wisdom...on the process of learning...not there...yet!

      I'm with thanos1983 on this one. 'National healthcare basketball society' maps to 'HCBS_max'?!? Wow! If anyone figures out a solution to this one, please let me know; I'd sure like to go in with you on patenting/exploiting it!


      Give a man a fish:  <%-{-{-{-<

        You can add a feature like "if you split the long string to words based on a dictionary and extract first letters, you'll get part of the abbreviation." Then let the algorithm decide whether it's useful or not. Similarly, you can train the algorithm on a large corpus of downloaded texts, maybe the fact that the words tend to appear in the same article could be used as a feature, too (or at least some number expressing their collocability).

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Since I have 15% matched to use as ground truth... can't I somehow use the other 50 columns in the file ( all of which has various data fields) to train some sort of supervised approach that uses all the data to suggest a match statistically?

      I want to say the answer has something to do with Expectation Maximization type approach but I'm way out of my depth here.

Re^3: Supervised machine learning algo for text matching across two files
by KurtZ (Friar) on May 24, 2017 at 23:40 UTC
    intuitively I can google both and then decide

    I think that's the only way IF goggle allowed automatic queries.

    The point is you don't have enough data for automatic associations, but Google does.