Re: finding groups in a text list by pattern?

Depending on what you consider "similar enough", and how much external knowledge you want to invest into the system, the following case comes relatively close for your sample data::

use strict;
my @names = sort map { chomp; $_ } <DATA>;

my $len = 2; # adjust to suit your taste

my @bucket;
FILL: {
  my $prefix = common_prefix(@bucket, $names[0]);
  if (length $prefix >= $len) {
    push @bucket, shift @names;
  } else {
    print common_prefix(@bucket),"\n";
    print join("\n", map { "-- $_" } @bucket), "\n";
    @bucket = ();
  };

  redo FILL while (@names)
};
print common_prefix(@bucket)
  if @bucket;

=head2 C<< common_prefix LIST >>

Extracts the common prefix out of a list of strings. The strings
may not contain the character C<\x00> because I'm lazy.

=cut

sub common_prefix {
  local $" = "\x00";
  "@_" =~ m!^([^\x00]*)[^\x00]*(\0\1[^\x00]*)*$!sm
    or die "Internal error: '@_' does not match the RE";
  $1;
};

__DATA__
U2 - October
U2 - Rattle and Hum
U2 - The Joshua Tree
Talking Heads - Sand In The Vaseline - Disc 1
Talking Heads - Sand In The Vaseline - Disc 2
[download]

Making $len larger than 4 will break for the case of "U2 -", and it might well be simpler to invest the knowledge that all directories are of the format $ARTIST - $ALBUM, and to split up that list and then simplify it. But for a braindead approach this script does well enough and gave me a nice situation to employ a regular expression... Of course, without the external knowledge, the pattern matching is not really good, as you see in the case of Disc 1 vs. Disc 2, where the common prefix is Disc; a human would have left off the whole thing.

Comment on Re: finding groups in a text list by pattern? Download Code

Replies are listed 'Best First'.
Re^2: finding groups in a text list by pattern? by howie (Sexton) on Nov 10, 2004 at 10:33 UTC
That was the sort of thing I was thinking of. I should have realised that it wasn't trivial when it was so hard to actually describe :-) In the end I went with a healthy dose of external knowledge, and got exactly what I was looking for: #!/usr/bin/perl use Algorithm::Bucketizer; use Number::Format qw(:subs); @patterns = ( qr/(.)\(dis[ck]\s\d+\)$/i, qr/(.)cd\s\d+$/, qr/^([^-]+\s\-\s).$/, ); $mediasize = 4.4 * 1000 * 1024 * 1024; print "Buckets are ", format_bytes($mediasize), " each.\n"; # Create a bucketizer my $b = Algorithm::Bucketizer->new( bucketsize => $mediasize, algorith +m => "retry" ); while (<DATA>) { chomp; chomp; ( $name, $asize, $path ) = split(/\t/); print $path,"\n"; if($path eq "") { $skipped_size += $asize; } else { $sizes{$pa +th} = $asize; } } print "\n"; $groupsize=0; $whole_line=""; foreach $line (sort keys %sizes) { $group = 0; $mysize = $sizes{$line}; foreach $re (@patterns) { ($result) = ($line =~ m/$re/); if( ($result ne "") && ($result eq $prevresults{$re})) { $group=1; } $prevresults{$re} = $result; } # override the pattern matching, if we're about to create an u +nfileable lump if( ($groupsize + $mysize) > $mediasize) { $group = 0; } if( (!$group) && ($groupsize>0)) { print "COMMITING $groupcount ($groupsize): $whole_line +\n"; $b->add_item($whole_line,$groupsize) \|\| die($!); $whole_line = ""; $groupsize = 0; $groupcount = 0; } $whole_line .= $line."\t"; $groupcount++; $groupsize += $mysize; } print "COMMITING $groupcount ($groupsize): $whole_line\n"; $b->add_item($whole_line,$groupsize) \|\| die($!); $b->optimize( algorithm => "random", maxtime => 60 ); for my $bucket ( $b->buckets() ) { print "\n\n# Bucket ",$bucket->serial(), " has ", format_bytes($bu +cket->level()), " in it.\n"; for my $item ( $bucket->items() ) { $item =~ s/\t$//; print " ",join("\n ",split(/\t/,$item)),"\n"; } print "\n"; } print "\n\nWe skipped ",format_bytes($skipped_size)," of Misc.\n\n"; __DATA__ The Creatures - A bestiary of 85907653 /var/music/WholeAlbums +2/The Creatures - A bestiary of The Cure - The Head On The Door 54978322 /var/music/WholeAlbums +2/The Cure - The Head On The Door The Disposable Heroes Of Hiphoprisy - Hypocracy Is The Greatest Luxury + 99314506 /var/music/WholeAlbums2/The Disposable Heroes Of Hi +phoprisy - Hypocracy Is The Greatest Luxury The Hives - Your New Favourite Band 33943504 /var/music/Who +leAlbums2/The Hives - Your New Favourite Band The Jimi Hendrix Experience - Electric Ladyland 110201733 /var/m +usic/WholeAlbums2/The Jimi Hendrix Experience - Electric Ladyland Prince - Emancipation - Disc 1 78945501 /var/music/WholeAlbums +2/Prince - Emancipation - Disc 1 Prince - Emancipation - Disc 2 80622538 /var/music/WholeAlbums +2/Prince - Emancipation - Disc 2 Prince - Emancipation - Disc 3 77673429 /var/music/WholeAlbums +2/Prince - Emancipation - Disc 3 Talking Heads - Stop Making Sense 1992195 /var/music/WholeAlbums +2/Talking Heads - Stop Making Sense Talking Heads - The Name Of This Band Is Talking Heads 125134992 + /var/music/WholeAlbums2/Talking Heads - The Name Of This Band Is Ta +lking Heads Tracy Chapman - Matters of the Heart 63415367 /var/music/Who +leAlbums2/Tracy Chapman - Matters of the Heart Turin Brakes - the Optimist LP 52981417 /var/music/WholeAlbums +2/Turin Brakes - the Optimist LP The Vines - Highly Evolved 127977590 /var/music/WholeAlbums +2/The Vines - Highly Evolved The Wannadies - Bagsy Me 47384936 /var/music/WholeAlbums +2/The Wannadies - Bagsy Me [download]	[reply] [d/l]

Replies are listed 'Best First'.

Re^2: finding groups in a text list by pattern?
by howie (Sexton) on Nov 10, 2004 at 10:33 UTC

#!/usr/bin/perl

use Algorithm::Bucketizer;
use Number::Format qw(:subs);

@patterns = (
        qr/(.*)\(*dis[ck]\s\d+\)*$/i,
        qr/(.*)cd\s*\d+$/,
        qr/^([^-]+\s\-\s).*$/,
);

$mediasize = 4.4 * 1000 * 1024 * 1024;

print "Buckets are ", format_bytes($mediasize), " each.\n";

# Create a bucketizer
my $b = Algorithm::Bucketizer->new( bucketsize => $mediasize, algorith
+m => "retry" );

while (<DATA>) {
    chomp; chomp;
    ( $name, $asize, $path ) = split(/\t/);
        print $path,"\n";

        if($path eq "") { $skipped_size += $asize; } else { $sizes{$pa
+th} = $asize; }
}

print "\n";
$groupsize=0; $whole_line="";

foreach $line (sort keys %sizes)
{
        $group = 0;
        $mysize = $sizes{$line};

        foreach $re (@patterns)
        {
                ($result) = ($line =~ m/$re/);
                if( ($result ne "") && ($result eq $prevresults{$re}))
                {
                        $group=1;
                }
                $prevresults{$re} = $result;
        }

        # override the pattern matching, if we're about to create an u
+nfileable lump
        if( ($groupsize + $mysize) > $mediasize) { $group = 0; }

        if( (!$group) && ($groupsize>0))
        {
                print "COMMITING $groupcount ($groupsize): $whole_line
+\n";
                $b->add_item($whole_line,$groupsize) || die($!);

                $whole_line = "";
                $groupsize = 0;
                $groupcount = 0;
        }

        $whole_line .= $line."\t";
        $groupcount++;
        $groupsize += $mysize;
}
print "COMMITING $groupcount ($groupsize): $whole_line\n";
$b->add_item($whole_line,$groupsize) || die($!);

$b->optimize( algorithm => "random", maxtime => 60 );

for my $bucket ( $b->buckets() ) {

    print "\n\n# Bucket ",$bucket->serial(), " has ", format_bytes($bu
+cket->level()), " in it.\n";
    for my $item ( $bucket->items() ) {
        $item =~ s/\t$//;
        print "    ",join("\n    ",split(/\t/,$item)),"\n";
    }
    print "\n";
}

print "\n\nWe skipped ",format_bytes($skipped_size)," of Misc.\n\n";

__DATA__
The Creatures - A bestiary of   85907653        /var/music/WholeAlbums
+2/The Creatures - A bestiary of
The Cure - The Head On The Door 54978322        /var/music/WholeAlbums
+2/The Cure - The Head On The Door
The Disposable Heroes Of Hiphoprisy - Hypocracy Is The Greatest Luxury
+  99314506        /var/music/WholeAlbums2/The Disposable Heroes Of Hi
+phoprisy - Hypocracy Is The Greatest Luxury
The Hives - Your New Favourite Band     33943504        /var/music/Who
+leAlbums2/The Hives - Your New Favourite Band
The Jimi Hendrix Experience - Electric Ladyland 110201733       /var/m
+usic/WholeAlbums2/The Jimi Hendrix Experience - Electric Ladyland
Prince - Emancipation - Disc 1  78945501        /var/music/WholeAlbums
+2/Prince - Emancipation - Disc 1
Prince - Emancipation - Disc 2  80622538        /var/music/WholeAlbums
+2/Prince - Emancipation - Disc 2
Prince - Emancipation - Disc 3  77673429        /var/music/WholeAlbums
+2/Prince - Emancipation - Disc 3
Talking Heads - Stop Making Sense       1992195 /var/music/WholeAlbums
+2/Talking Heads - Stop Making Sense
Talking Heads - The Name Of This Band Is Talking Heads  125134992     
+  /var/music/WholeAlbums2/Talking Heads - The Name Of This Band Is Ta
+lking Heads
Tracy Chapman - Matters of the Heart    63415367        /var/music/Who
+leAlbums2/Tracy Chapman - Matters of the Heart
Turin Brakes - the Optimist LP  52981417        /var/music/WholeAlbums
+2/Turin Brakes - the Optimist LP
The Vines - Highly Evolved      127977590       /var/music/WholeAlbums
+2/The Vines - Highly Evolved
The Wannadies - Bagsy Me        47384936        /var/music/WholeAlbums
+2/The Wannadies - Bagsy Me
[download]

[reply]
[d/l]