Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Challenge: Find median string in a list

by Limbic~Region (Chancellor)
on Jul 04, 2007 at 18:10 UTC ( [id://624919]=perlquestion: print w/replies, xml ) Need Help??

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
This challenge was posed last night in #perl IRC on freenode.

Given:

  • A large list (circa 15K) of unique strings
  • Composed of the character set [a-z0-9]
  • With lengths varying from 1 to 24 characters
Determine which string is the median string and the distance from each string to the median string.

For the purposes of this challenge, distance is the Levenshtein Distance. The median string is the string where the distance between itself and the furthest away string in the list is minimized. In other words, if you were to calculate the distance of the string furthest away for every string in the list, the median string would be the one with the shortest distance.

I recognize that my use of median is likely incorrect. In the event of a conflict, refer to the definition here as the intended one. There may be more than one string that fits the defintion of median string, in this case only 1 such string is required.

If you come up with a great solution that is not very memory efficient feel free to post it but if you can, limit memory consumption to 750MB. I have a feeling a trie may provide for a better solution, but the best I can come up with so far is below:

Update: ysth suggested a number of optimizations that I had already tried. He had the sense to try them together which makes it blazingly fast. While the algorithm is mine, credit for the optimizations go to him.

Cheers - L~R

Replies are listed 'Best First'.
Re: Challenge: Find median string in a list
by blokhead (Monsignor) on Jul 04, 2007 at 19:18 UTC
    This problem in graph theory is known as the graph center problem. So here's a cop-out solution using Graph.pm:

    I'm sure there should be a more efficient algorithm for graph center than computing all pairs shortest paths (although it makes a difference if the bottleneck is computing Levenshtein distance or dealing with the big graph), but this is a starting point. Especially interesting would be an algorithm that explored the graph in an "on-line" fashion to avoid precomputing all the pairwise distances.

    BTW, Graph.pm does offer a center_vertices method for the APSP object, but it appears to be broken (at least in my version of the module).

    blokhead

      blokhead,
      This solution took nearly 5 hours (compared to 0.28 seconds). I am not sure how well it would have done if center_vertices would have worked.

      Cheers - L~R

Re: Challenge: Find median string in a list
by ysth (Canon) on Jul 04, 2007 at 22:16 UTC
      ysth,
      At one time or another, I had tried all the tweaks you had provided and wrote them off as micro-optimizations. Combined together they really do give impressive results. Thanks for suggesting I have a second look.

      Cheers - L~R

Re: Challenge: Find median string in a list
by ysth (Canon) on Jul 04, 2007 at 18:34 UTC
    Looks pretty good to me. Your > could be >= in the "next if" and "last if"s.

    Update: meant >/>=, not </<=.

        As we discussed in the chatterbox, and you determined experimentally, it's better to remove that assumption and make the inner loop go over all the strings, but keep the bailing out when the previously determined minimum maximum distance (I chortle as I write that) is met or exceeded.

        The current iteration of your code still has the "next if" commented out; I don't see a reason for this.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://624919]
Front-paged by ysth
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2024-04-25 19:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found