Re: how to speed up comparison between two files

Replies are listed 'Best First'.
Re^2: how to speed up comparison between two files by Anonymous Monk on Dec 10, 2014 at 13:06 UTC
So I was thinking how I would do that and came up with this (MySQL): drop table if exists clusters; create table clusters ( factor char(1) not null, sequence varchar(10) not null, num integer unsigned not null, index (sequence) ); insert into clusters values ('C', 'seq1', 25), ('A', 'seq1', 20), ('E', 'seq2', 45), ('D', 'seq2', 30), ('B', 'seq2', 25), ('B', 'seq1', 40), ('B', 'seq2', 80) ; select sequence, group_concat(factor order by factor) as factors, group_concat(num order by factor) as nums from clusters group by sequence having factors like '%A%B%C%' or factors like '%B%D%E%' [download] Output: `+----------+---------+-------------+ \| sequence \| factors \| nums \| +----------+---------+-------------+ \| seq1 \| A,B,C \| 20,40,25 \| \| seq2 \| B,B,D,E \| 80,25,30,45 \| +----------+---------+-------------+` [download] I guess there is a better way but I prefer not to mess with SQL too much. As a bonus, some grouping and printing routines. Features flexible search pattern (not limited to 3) and other nice things. use strict; use warnings; use Carp; use List::Util 'max'; my @factors = ('B,B,D,D,E,E' x 10) x 10_000; my @numbers = ('80,25,30,1000,45,0.5' x 10) x 10_000; my $search = 'BDE'; for my $i ( 0 .. $#factors ) { print_groups( group( $factors[$i], $numbers[$i], $search ) ); } sub group { my ( $factors, $numbers, $search ) = @_; my @factors = split ',', $factors; my @numbers = split ',', $numbers; croak "Fatal error!" unless @factors == @numbers; my @groups; for my $s (split '', $search) { my @temp = ($s); for my $i ( 0 .. $#factors ) { push @temp, $numbers[$i] if $factors[$i] eq $s; } push @groups, \@temp; } return \@groups; } sub print_groups { my $groups = shift; my $current = shift \|\| 0; # or defined-or my $prev = shift; if ( not defined $groups->[$current] ) { print $prev , "\n"; return; } my ( $factor, @numbers ) = @{ $groups->[$current] }; my $max = max map length, @numbers; for my $number (@numbers) { print_groups( $groups, $current + 1, ( $prev ? "$prev " : '' ) . ( sprintf '%s = %-*s', $factor, $max, $number ) ); } } [download] Output: `B = 80 D = 30 E = 45 B = 80 D = 30 E = 0.5 B = 80 D = 1000 E = 45 B = 80 D = 1000 E = 0.5 B = 25 D = 30 E = 45 B = 25 D = 30 E = 0.5 B = 25 D = 1000 E = 45 B = 25 D = 1000 E = 0.5 ... (24_200_000 rows)` [download] That takes about a minute on my pretty underpowered laptop, and I would expect the database to be reasonably fast too.	[reply] [d/l] [select]
Re^2: how to speed up comparison between two files by Laurent_R (Canon) on Dec 10, 2014 at 23:15 UTC
Or maybe exactly what databases are often doing rather poorly... Consider that a hash is often about 3 orders of magnitude faster than a DB.	[reply]