Re^2: how to speed up comparison between two files

So I was thinking how I would do that and came up with this (MySQL):

drop table if exists clusters;

create table clusters
(
    factor      char(1) not null,
    sequence    varchar(10) not null,
    num         integer unsigned not null,
    index (sequence)
);

insert into clusters values
    ('C', 'seq1', 25),
    ('A', 'seq1', 20),
    ('E', 'seq2', 45),
    ('D', 'seq2', 30),
    ('B', 'seq2', 25),
    ('B', 'seq1', 40),
    ('B', 'seq2', 80)
;

select
    sequence,
    group_concat(factor order by factor) as factors,
    group_concat(num    order by factor) as nums
from
    clusters
group by
    sequence
having
    factors like '%A%B%C%'
    or
    factors like '%B%D%E%'
[download]

Output:

+----------+---------+-------------+
| sequence | factors | nums        |
+----------+---------+-------------+
| seq1     | A,B,C   | 20,40,25    |
| seq2     | B,B,D,E | 80,25,30,45 |
+----------+---------+-------------+
[download]

I guess there is a better way but I prefer not to mess with SQL too much.

As a bonus, some grouping and printing routines. Features flexible search pattern (not limited to 3) and other nice things.

use strict;
use warnings;
use Carp;
use List::Util 'max';

my @factors = ('B,B,D,D,E,E' x 10) x 10_000;
my @numbers = ('80,25,30,1000,45,0.5' x 10) x 10_000;

my $search  = 'BDE';

for my $i ( 0 .. $#factors ) {
    print_groups(
        group( $factors[$i], $numbers[$i], $search )
    );
}

sub group {
    my ( $factors, $numbers, $search ) = @_;

    my @factors = split ',', $factors;
    my @numbers = split ',', $numbers;

    croak "Fatal error!"
      unless @factors == @numbers;

    my @groups;

    for my $s (split '', $search) {
        my @temp = ($s);
        for my $i ( 0 .. $#factors ) {
            push @temp, $numbers[$i] if $factors[$i] eq $s;
        }
        push @groups, \@temp;
    }

    return \@groups;
}

sub print_groups {
    my $groups  = shift;
    my $current = shift || 0; # or defined-or
    my $prev    = shift;

    if ( not defined $groups->[$current] ) {
        print $prev , "\n";
        return;
    }

    my ( $factor, @numbers ) = @{ $groups->[$current] };

    my $max = max map length, @numbers;

    for my $number (@numbers) {
        print_groups(
            $groups,
            $current + 1,
            ( $prev ? "$prev " : '' )
            . ( sprintf '%s = %-*s', $factor, $max, $number )
        );
    }
}
[download]

Output:

B = 80 D = 30   E = 45 
B = 80 D = 30   E = 0.5
B = 80 D = 1000 E = 45 
B = 80 D = 1000 E = 0.5
B = 25 D = 30   E = 45 
B = 25 D = 30   E = 0.5
B = 25 D = 1000 E = 45 
B = 25 D = 1000 E = 0.5
... (24_200_000 rows)
[download]

That takes about a minute on my pretty underpowered laptop, and I would expect the database to be reasonably fast too.

Comment on Re^2: how to speed up comparison between two files Select or Download Code