in reply to finding groups in a text list by pattern?
Depending on what you consider "similar enough", and how much external knowledge you want to invest into the system, the following case comes relatively close for your sample data::
use strict; my @names = sort map { chomp; $_ } <DATA>; my $len = 2; # adjust to suit your taste my @bucket; FILL: { my $prefix = common_prefix(@bucket, $names[0]); if (length $prefix >= $len) { push @bucket, shift @names; } else { print common_prefix(@bucket),"\n"; print join("\n", map { "-- $_" } @bucket), "\n"; @bucket = (); }; redo FILL while (@names) }; print common_prefix(@bucket) if @bucket; =head2 C<< common_prefix LIST >> Extracts the common prefix out of a list of strings. The strings may not contain the character C<\x00> because I'm lazy. =cut sub common_prefix { local $" = "\x00"; "@_" =~ m!^([^\x00]*)[^\x00]*(\0\1[^\x00]*)*$!sm or die "Internal error: '@_' does not match the RE"; $1; }; __DATA__ U2 - October U2 - Rattle and Hum U2 - The Joshua Tree Talking Heads - Sand In The Vaseline - Disc 1 Talking Heads - Sand In The Vaseline - Disc 2
Making $len larger than 4 will break for the case of "U2 -", and it might well be simpler to invest the knowledge that all directories are of the format $ARTIST - $ALBUM, and to split up that list and then simplify it. But for a braindead approach this script does well enough and gave me a nice situation to employ a regular expression... Of course, without the external knowledge, the pattern matching is not really good, as you see in the case of Disc 1 vs. Disc 2, where the common prefix is Disc; a human would have left off the whole thing.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: finding groups in a text list by pattern?
by howie (Sexton) on Nov 10, 2004 at 10:33 UTC |