in reply to finding groups in a text list by pattern?

Depending on what you consider "similar enough", and how much external knowledge you want to invest into the system, the following case comes relatively close for your sample data::

use strict; my @names = sort map { chomp; $_ } <DATA>; my $len = 2; # adjust to suit your taste my @bucket; FILL: { my $prefix = common_prefix(@bucket, $names[0]); if (length $prefix >= $len) { push @bucket, shift @names; } else { print common_prefix(@bucket),"\n"; print join("\n", map { "-- $_" } @bucket), "\n"; @bucket = (); }; redo FILL while (@names) }; print common_prefix(@bucket) if @bucket; =head2 C<< common_prefix LIST >> Extracts the common prefix out of a list of strings. The strings may not contain the character C<\x00> because I'm lazy. =cut sub common_prefix { local $" = "\x00"; "@_" =~ m!^([^\x00]*)[^\x00]*(\0\1[^\x00]*)*$!sm or die "Internal error: '@_' does not match the RE"; $1; }; __DATA__ U2 - October U2 - Rattle and Hum U2 - The Joshua Tree Talking Heads - Sand In The Vaseline - Disc 1 Talking Heads - Sand In The Vaseline - Disc 2

Making $len larger than 4 will break for the case of "U2 -", and it might well be simpler to invest the knowledge that all directories are of the format $ARTIST - $ALBUM, and to split up that list and then simplify it. But for a braindead approach this script does well enough and gave me a nice situation to employ a regular expression... Of course, without the external knowledge, the pattern matching is not really good, as you see in the case of Disc 1 vs. Disc 2, where the common prefix is Disc; a human would have left off the whole thing.

Replies are listed 'Best First'.
Re^2: finding groups in a text list by pattern?
by howie (Sexton) on Nov 10, 2004 at 10:33 UTC
    That was the sort of thing I was thinking of. I should have realised that it wasn't trivial when it was so hard to actually describe :-) In the end I went with a healthy dose of external knowledge, and got exactly what I was looking for:
    #!/usr/bin/perl use Algorithm::Bucketizer; use Number::Format qw(:subs); @patterns = ( qr/(.*)\(*dis[ck]\s\d+\)*$/i, qr/(.*)cd\s*\d+$/, qr/^([^-]+\s\-\s).*$/, ); $mediasize = 4.4 * 1000 * 1024 * 1024; print "Buckets are ", format_bytes($mediasize), " each.\n"; # Create a bucketizer my $b = Algorithm::Bucketizer->new( bucketsize => $mediasize, algorith +m => "retry" ); while (<DATA>) { chomp; chomp; ( $name, $asize, $path ) = split(/\t/); print $path,"\n"; if($path eq "") { $skipped_size += $asize; } else { $sizes{$pa +th} = $asize; } } print "\n"; $groupsize=0; $whole_line=""; foreach $line (sort keys %sizes) { $group = 0; $mysize = $sizes{$line}; foreach $re (@patterns) { ($result) = ($line =~ m/$re/); if( ($result ne "") && ($result eq $prevresults{$re})) { $group=1; } $prevresults{$re} = $result; } # override the pattern matching, if we're about to create an u +nfileable lump if( ($groupsize + $mysize) > $mediasize) { $group = 0; } if( (!$group) && ($groupsize>0)) { print "COMMITING $groupcount ($groupsize): $whole_line +\n"; $b->add_item($whole_line,$groupsize) || die($!); $whole_line = ""; $groupsize = 0; $groupcount = 0; } $whole_line .= $line."\t"; $groupcount++; $groupsize += $mysize; } print "COMMITING $groupcount ($groupsize): $whole_line\n"; $b->add_item($whole_line,$groupsize) || die($!); $b->optimize( algorithm => "random", maxtime => 60 ); for my $bucket ( $b->buckets() ) { print "\n\n# Bucket ",$bucket->serial(), " has ", format_bytes($bu +cket->level()), " in it.\n"; for my $item ( $bucket->items() ) { $item =~ s/\t$//; print " ",join("\n ",split(/\t/,$item)),"\n"; } print "\n"; } print "\n\nWe skipped ",format_bytes($skipped_size)," of Misc.\n\n"; __DATA__ The Creatures - A bestiary of 85907653 /var/music/WholeAlbums +2/The Creatures - A bestiary of The Cure - The Head On The Door 54978322 /var/music/WholeAlbums +2/The Cure - The Head On The Door The Disposable Heroes Of Hiphoprisy - Hypocracy Is The Greatest Luxury + 99314506 /var/music/WholeAlbums2/The Disposable Heroes Of Hi +phoprisy - Hypocracy Is The Greatest Luxury The Hives - Your New Favourite Band 33943504 /var/music/Who +leAlbums2/The Hives - Your New Favourite Band The Jimi Hendrix Experience - Electric Ladyland 110201733 /var/m +usic/WholeAlbums2/The Jimi Hendrix Experience - Electric Ladyland Prince - Emancipation - Disc 1 78945501 /var/music/WholeAlbums +2/Prince - Emancipation - Disc 1 Prince - Emancipation - Disc 2 80622538 /var/music/WholeAlbums +2/Prince - Emancipation - Disc 2 Prince - Emancipation - Disc 3 77673429 /var/music/WholeAlbums +2/Prince - Emancipation - Disc 3 Talking Heads - Stop Making Sense 1992195 /var/music/WholeAlbums +2/Talking Heads - Stop Making Sense Talking Heads - The Name Of This Band Is Talking Heads 125134992 + /var/music/WholeAlbums2/Talking Heads - The Name Of This Band Is Ta +lking Heads Tracy Chapman - Matters of the Heart 63415367 /var/music/Who +leAlbums2/Tracy Chapman - Matters of the Heart Turin Brakes - the Optimist LP 52981417 /var/music/WholeAlbums +2/Turin Brakes - the Optimist LP The Vines - Highly Evolved 127977590 /var/music/WholeAlbums +2/The Vines - Highly Evolved The Wannadies - Bagsy Me 47384936 /var/music/WholeAlbums +2/The Wannadies - Bagsy Me