Gangabass has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Monks.

I need a way to generate all possible strings which sounds like my source string. By the way it's not a string it's a text (plant names, for example "Aaronsohnia factorovskyi" or "Citrullus lanatus 'Charleston Gray'" etc. ).

I have read about Soundex, Text::Levenshtein and Text::Metaphone but i'm need some kind of reverse algorithm.

Please give me hints where to go.

  • Comment on Generate strings which sounds like source string

Replies are listed 'Best First'.
Re: Generate strings which sounds like source string
by desemondo (Hermit) on Feb 21, 2010 at 01:08 UTC
    that truly sounds facinating,

    Reference: phonetic pronounciation If this was my task I'd look at passing all of the words/strings through Text::Metaphone and saving the results, then index them and look for those that have varying degrees of sounding like each other based on your definition of sounds like...

    I know thats not a reverse algorithm approach, but it seems simpler to me...

    If that is simply not feasible, eg. the complete list of possible word matches is unknown and you need to generate words/strings, then you might have to create your own module that:-
    (a) calculates the phonetic of your word/string, then
    (b) calculates all possible phonetic combinations that have the same similarity.
    But, I think this approach would suffer from a lot of invalid generated words that don't make sense...
Re: Generate strings which sounds like source string
by Khen1950fx (Canon) on Feb 21, 2010 at 00:54 UTC
Re: Generate strings which sounds like source string
by zentara (Cardinal) on Feb 21, 2010 at 11:41 UTC
Re: Generate strings which sounds like source string
by BrowserUk (Patriarch) on Feb 22, 2010 at 15:39 UTC

    Can you describe the use for this? And how serious are you about pursuing it?

    Using Soundex, Metaphone or Levenshtein seem to be pretty useless for reverse lookups for this purpose. I've had an idea that might prove more effective, but it would involve a fair amount of work to develop it.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      Actually, Soundex is so trivial of an algorithm it isn't too difficult to create a reverse lookup. On the other hand, it seems there would be no need. Just perform a forward encoding of all words in your dictionary and store the result in a database for future lookups.

      The real problem is that all of these algorithms, to include the double metaphone, only encode the first n consonants (4 in the case of double metaphone unless the first character is a vowel). I am interesting in your idea (even without implementation).

      Cheers - L~R

        Soundex is so trivial of an algorithm it isn't too difficult to create a reverse lookup.

        I know, I tried it, but the results are pretty useless. Most of these "matches" are nothing like the given words:

        The problems with soundex include:

        1. it only "matches" words that begin with the same letter.

          But for example, 'Cray' is a far better sound-alike for 'Gray' than most of those above.

          And there are many words or phrases that might match 'Citrullus' that begin with 'S'. Say 'Sit with us'.

        2. it discards all vowels and 'h's.

          Hence matches like 'Gray; with 'giaour'.

        3. it only considers 4 significant consonents.

          Hence matches like 'Charleston' with 'carls' & 'creolizations'

        The name Soundex is deceptive. It has little or nothing to do with the sound.

        Metaphone is too specific. Many of the words in the OPs examples would never match anything if encoded at their full length, and if you reduce the encoding length across the board, you get far too many hits for other words. And to dynamically adjust the length of the encoding successfully, you need to encode your dictionary words at all lengths.

        I am interesting in your idea (even without implementation).

        The problem with developing my idea is that it would be a table-driven algorithm that would require considerable effort (programming & manual), in order to derive the tables. Not worth the effort unless the was at least an outside chance someone might make use of it.

        Hence I'd like to know if the OP is serious. And, what he (or other people) might use it for. That might give me an idea as to whether it is worth the time and effort.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.