howie has asked for the wisdom of the Perl Monks concerning the following question:

I'm playing around with Algorithm::Bucketizer to fill backup DVD-Rs, and I'd like to have something to spot patterns in the list of files I'm backing up (my MP3s). For example, I want to have a list of directory names like:
U2 - October U2 - Rattle and Hum U2 - The Joshua Tree Talking Heads - Sand In The Vaseline - Disc 1 Talking Heads - Sand In The Vaseline - Disc 2
and somehow give me a list of the similarly named groups (U2 - xxxxx, and Talking Heads - xxx). The criteria for 'similar' would be a group of masks/patterns of some kind (xx -xx - Disc n).

I haven't a clue where to start with this. I thought about something like Levenshtein Distance, but I'm really after a pattern matching type of solution. I get the feeling that there must be some heavy regexp magic that will do what I want in one line...

(I realise that there is some application-specific knowledge in this particular case, but I'm curious to know if there is a general solution).

Replies are listed 'Best First'.
Re: finding groups in a text list by pattern?
by Corion (Patriarch) on Nov 09, 2004 at 12:31 UTC

    Depending on what you consider "similar enough", and how much external knowledge you want to invest into the system, the following case comes relatively close for your sample data::

    use strict; my @names = sort map { chomp; $_ } <DATA>; my $len = 2; # adjust to suit your taste my @bucket; FILL: { my $prefix = common_prefix(@bucket, $names[0]); if (length $prefix >= $len) { push @bucket, shift @names; } else { print common_prefix(@bucket),"\n"; print join("\n", map { "-- $_" } @bucket), "\n"; @bucket = (); }; redo FILL while (@names) }; print common_prefix(@bucket) if @bucket; =head2 C<< common_prefix LIST >> Extracts the common prefix out of a list of strings. The strings may not contain the character C<\x00> because I'm lazy. =cut sub common_prefix { local $" = "\x00"; "@_" =~ m!^([^\x00]*)[^\x00]*(\0\1[^\x00]*)*$!sm or die "Internal error: '@_' does not match the RE"; $1; }; __DATA__ U2 - October U2 - Rattle and Hum U2 - The Joshua Tree Talking Heads - Sand In The Vaseline - Disc 1 Talking Heads - Sand In The Vaseline - Disc 2

    Making $len larger than 4 will break for the case of "U2 -", and it might well be simpler to invest the knowledge that all directories are of the format $ARTIST - $ALBUM, and to split up that list and then simplify it. But for a braindead approach this script does well enough and gave me a nice situation to employ a regular expression... Of course, without the external knowledge, the pattern matching is not really good, as you see in the case of Disc 1 vs. Disc 2, where the common prefix is Disc; a human would have left off the whole thing.

      That was the sort of thing I was thinking of. I should have realised that it wasn't trivial when it was so hard to actually describe :-) In the end I went with a healthy dose of external knowledge, and got exactly what I was looking for:
      #!/usr/bin/perl use Algorithm::Bucketizer; use Number::Format qw(:subs); @patterns = ( qr/(.*)\(*dis[ck]\s\d+\)*$/i, qr/(.*)cd\s*\d+$/, qr/^([^-]+\s\-\s).*$/, ); $mediasize = 4.4 * 1000 * 1024 * 1024; print "Buckets are ", format_bytes($mediasize), " each.\n"; # Create a bucketizer my $b = Algorithm::Bucketizer->new( bucketsize => $mediasize, algorith +m => "retry" ); while (<DATA>) { chomp; chomp; ( $name, $asize, $path ) = split(/\t/); print $path,"\n"; if($path eq "") { $skipped_size += $asize; } else { $sizes{$pa +th} = $asize; } } print "\n"; $groupsize=0; $whole_line=""; foreach $line (sort keys %sizes) { $group = 0; $mysize = $sizes{$line}; foreach $re (@patterns) { ($result) = ($line =~ m/$re/); if( ($result ne "") && ($result eq $prevresults{$re})) { $group=1; } $prevresults{$re} = $result; } # override the pattern matching, if we're about to create an u +nfileable lump if( ($groupsize + $mysize) > $mediasize) { $group = 0; } if( (!$group) && ($groupsize>0)) { print "COMMITING $groupcount ($groupsize): $whole_line +\n"; $b->add_item($whole_line,$groupsize) || die($!); $whole_line = ""; $groupsize = 0; $groupcount = 0; } $whole_line .= $line."\t"; $groupcount++; $groupsize += $mysize; } print "COMMITING $groupcount ($groupsize): $whole_line\n"; $b->add_item($whole_line,$groupsize) || die($!); $b->optimize( algorithm => "random", maxtime => 60 ); for my $bucket ( $b->buckets() ) { print "\n\n# Bucket ",$bucket->serial(), " has ", format_bytes($bu +cket->level()), " in it.\n"; for my $item ( $bucket->items() ) { $item =~ s/\t$//; print " ",join("\n ",split(/\t/,$item)),"\n"; } print "\n"; } print "\n\nWe skipped ",format_bytes($skipped_size)," of Misc.\n\n"; __DATA__ The Creatures - A bestiary of 85907653 /var/music/WholeAlbums +2/The Creatures - A bestiary of The Cure - The Head On The Door 54978322 /var/music/WholeAlbums +2/The Cure - The Head On The Door The Disposable Heroes Of Hiphoprisy - Hypocracy Is The Greatest Luxury + 99314506 /var/music/WholeAlbums2/The Disposable Heroes Of Hi +phoprisy - Hypocracy Is The Greatest Luxury The Hives - Your New Favourite Band 33943504 /var/music/Who +leAlbums2/The Hives - Your New Favourite Band The Jimi Hendrix Experience - Electric Ladyland 110201733 /var/m +usic/WholeAlbums2/The Jimi Hendrix Experience - Electric Ladyland Prince - Emancipation - Disc 1 78945501 /var/music/WholeAlbums +2/Prince - Emancipation - Disc 1 Prince - Emancipation - Disc 2 80622538 /var/music/WholeAlbums +2/Prince - Emancipation - Disc 2 Prince - Emancipation - Disc 3 77673429 /var/music/WholeAlbums +2/Prince - Emancipation - Disc 3 Talking Heads - Stop Making Sense 1992195 /var/music/WholeAlbums +2/Talking Heads - Stop Making Sense Talking Heads - The Name Of This Band Is Talking Heads 125134992 + /var/music/WholeAlbums2/Talking Heads - The Name Of This Band Is Ta +lking Heads Tracy Chapman - Matters of the Heart 63415367 /var/music/Who +leAlbums2/Tracy Chapman - Matters of the Heart Turin Brakes - the Optimist LP 52981417 /var/music/WholeAlbums +2/Turin Brakes - the Optimist LP The Vines - Highly Evolved 127977590 /var/music/WholeAlbums +2/The Vines - Highly Evolved The Wannadies - Bagsy Me 47384936 /var/music/WholeAlbums +2/The Wannadies - Bagsy Me
Re: finding groups in a text list by pattern?
by pelagic (Priest) on Nov 09, 2004 at 12:32 UTC
    now this is really OT
    anyway: the words similar & music lead my brain to this site where similarity is made visual (in a way).

    and I knew before that this doesn't help you ;)

    pelagic