in reply to Re^2: Generate strings which sounds like source string
in thread Generate strings which sounds like source string
Soundex is so trivial of an algorithm it isn't too difficult to create a reverse lookup.
I know, I tried it, but the results are pretty useless. Most of these "matches" are nothing like the given words:
c:\test>soundex "Aaronsohnia factorovskyi" "Aaronsohnia : [ aerenchyma aerenchymas aeromagnetic aeromecha +nics airiness airinesses airing airings airns arames arenaceous arena +s arenes arenicolous arenose arenous armagnac armagnacs armchair armc +hairs armies armiger armigeral armigero armigeros armigerous armigers + armistice armistices arms armsful arnica arnicas aromas arrange arra +nged arrangement arrangements arranger arrangers arranges arranging a +rraying arrowing arums aurums awareness awarenesses ] factorovskyi" : [ factor factorable factorage factorages facto +red factorial factorials factories factoring factorization factorizat +ions factorize factorized factorizes factorizing factors factorship f +actorships factory factorylike facture factures faggotries faggotry f +agoter fagoters faster feaster feasters feistier fester festered fest +ering festers figeater figeaters fighter fighters fixture fixtures fo +ster fosterage fosterages fostered fosterer fosterers fostering foste +rling fosterlings fosters foxtrot foxtrots foxtrotted foxtrotting fus +tier ] "Citrullus lanatus 'Charleston Gray'" "Citrullus : [ catarrhal catarrhally caterwaul caterwauled cat +erwauling caterwauls chitterlings citral citrals citrulline citrullin +es cotterless cuadrilla cuadrillas ] lanatus : [ lamedhs lameds landaus landgrab landgrabs landgrav +e landgraves lands landscape landscaped landscaper landscapers landsc +apes landscaping landscapist landscapists landside landsides landskip + landskips landsleit landslid landslidden landslide landslides landsl +iding landslip landslips landsman landsmen lemmatize lemmatized lemma +tizes lemmatizing lends lenites lenities lentic lenticel lenticels le +nticular lenticule lenticules lentigines lentigo lentisk lentisks len +tissimo lentos limeades limites limits limnetic lindies linnets linti +est lints lunatic lunatics lunets lunettes lunts ] 'Charleston : [ careless carelessly carelessness carelessnesse +s carioles carles carless carlish carls carols carolus caroluses carr +ells carrels carrioles carryalls ceorlish ceorls cereals charleys cha +rlies charlock charlocks cheerless cheerlessly cheerlessness cheerles +snesses cherrylike chorales chorals churlish churlishly churlishness +churlishnesses churls corals coreless coreligionist coreligionists co +rollas corrals craals crawliest crawls crawlways creels creoles creol +ise creolised creolises creolising creolization creolizations creoliz +e creolized creolizes creolizing crewels crewless criollos cruelest c +ruellest cureless curlews curlicue curlicued curlicues curlicuing cur +liest curls curlycue curlycues ] Gray'" : [ gar gaur gayer gear gerah gharri gharry gherao giao +ur giro goer gooier gor gore gory gray gree grew grey grow grue guar +guiro gurry guru gyre gyri gyro ]
The problems with soundex include:
But for example, 'Cray' is a far better sound-alike for 'Gray' than most of those above.
And there are many words or phrases that might match 'Citrullus' that begin with 'S'. Say 'Sit with us'.
Hence matches like 'Gray; with 'giaour'.
Hence matches like 'Charleston' with 'carls' & 'creolizations'
The name Soundex is deceptive. It has little or nothing to do with the sound.
Metaphone is too specific. Many of the words in the OPs examples would never match anything if encoded at their full length, and if you reduce the encoding length across the board, you get far too many hits for other words. And to dynamically adjust the length of the encoding successfully, you need to encode your dictionary words at all lengths.
I am interesting in your idea (even without implementation).
The problem with developing my idea is that it would be a table-driven algorithm that would require considerable effort (programming & manual), in order to derive the tables. Not worth the effort unless the was at least an outside chance someone might make use of it.
Hence I'd like to know if the OP is serious. And, what he (or other people) might use it for. That might give me an idea as to whether it is worth the time and effort.
|
|---|