in reply to Re: how to speed up comparison between two files
in thread how to speed up comparison between two files

So I was thinking how I would do that and came up with this (MySQL):
drop table if exists clusters; create table clusters ( factor char(1) not null, sequence varchar(10) not null, num integer unsigned not null, index (sequence) ); insert into clusters values ('C', 'seq1', 25), ('A', 'seq1', 20), ('E', 'seq2', 45), ('D', 'seq2', 30), ('B', 'seq2', 25), ('B', 'seq1', 40), ('B', 'seq2', 80) ; select sequence, group_concat(factor order by factor) as factors, group_concat(num order by factor) as nums from clusters group by sequence having factors like '%A%B%C%' or factors like '%B%D%E%'
Output:
+----------+---------+-------------+ | sequence | factors | nums | +----------+---------+-------------+ | seq1 | A,B,C | 20,40,25 | | seq2 | B,B,D,E | 80,25,30,45 | +----------+---------+-------------+
I guess there is a better way but I prefer not to mess with SQL too much.

As a bonus, some grouping and printing routines. Features flexible search pattern (not limited to 3) and other nice things.

use strict; use warnings; use Carp; use List::Util 'max'; my @factors = ('B,B,D,D,E,E' x 10) x 10_000; my @numbers = ('80,25,30,1000,45,0.5' x 10) x 10_000; my $search = 'BDE'; for my $i ( 0 .. $#factors ) { print_groups( group( $factors[$i], $numbers[$i], $search ) ); } sub group { my ( $factors, $numbers, $search ) = @_; my @factors = split ',', $factors; my @numbers = split ',', $numbers; croak "Fatal error!" unless @factors == @numbers; my @groups; for my $s (split '', $search) { my @temp = ($s); for my $i ( 0 .. $#factors ) { push @temp, $numbers[$i] if $factors[$i] eq $s; } push @groups, \@temp; } return \@groups; } sub print_groups { my $groups = shift; my $current = shift || 0; # or defined-or my $prev = shift; if ( not defined $groups->[$current] ) { print $prev , "\n"; return; } my ( $factor, @numbers ) = @{ $groups->[$current] }; my $max = max map length, @numbers; for my $number (@numbers) { print_groups( $groups, $current + 1, ( $prev ? "$prev " : '' ) . ( sprintf '%s = %-*s', $factor, $max, $number ) ); } }
Output:
B = 80 D = 30 E = 45 B = 80 D = 30 E = 0.5 B = 80 D = 1000 E = 45 B = 80 D = 1000 E = 0.5 B = 25 D = 30 E = 45 B = 25 D = 30 E = 0.5 B = 25 D = 1000 E = 45 B = 25 D = 1000 E = 0.5 ... (24_200_000 rows)
That takes about a minute on my pretty underpowered laptop, and I would expect the database to be reasonably fast too.