Re: Comparing string characters

Code:

#!/usr/bin/env perl

use strict;
use warnings;

my @tests = (
    [qw{ABCGE ABCGE}],
    [qw{ABCGE FGCGB}],
    [qw{ABCGE JHAGT}],
);

for my $strings (@tests) {
    my $diff = 0;

    for my $i (0 .. length($strings->[0]) - 1) {
        my @chars = map substr($strings->[$_], $i, 1), 0, 1;
        ++$diff if $chars[0] ne $chars[1];
    }

    print "@$strings $diff\n";
}
[download]

Output:

ABCGE ABCGE 0
ABCGE FGCGB 3
ABCGE JHAGT 4
[download]

You've received a number of solutions; use Benchmark to see which is the most efficient. From your username, I'm guessing you're dealing with biological data: typically huge and efficiency is usually important.

— Ken

Comment on Re: Comparing string characters Select or Download Code

Replies are listed 'Best First'.
Re^2: Comparing string characters by Marshall (Canon) on Nov 23, 2021 at 18:56 UTC
If there is a desire to increase efficiency, I would get rid of the map which is actually a loop, and also the array indexing. Code shown below. Of course, as you suggest benchmarking is absolutely necessary if high performance is desired. The example test strings need to be much longer because now with just 5 characters, the setup code dwarfs the actual comparison code. An XS procedure written in C could be done very efficiently. A goal would be to reduce the number of main memory cycles. I have a 64 bit machine. There is a bit of setup and cleanup code to focus on the 64 bit aligned block of memory. Read the buffers 8 bytes at a time. Do an XOR operation. If zero, all 8 bytes are the same. If not, then test each byte to see how many bytes differed. An assembly solution probably would provide significant performance increases over the C implementation. This is one of those cases where a human can probably easily beat the compiler. That's because there are some special buffer oriented instructions that are very difficult for the compiler to use effectively. Anyway some code for comparison.. `use strict; use warnings; use Data::Dumper; use List::Util qw(min); my @tests = ( [qw{ABCGE ABCGE}], [qw{ABCGE FGCGB}], [qw{ABCGE JHAGT}], ); foreach my $arry_ref (@tests) { my ($str1,$str2) = @$arry_ref; # perhaps optional check to use shortest length my $len = min (length ($str1), length($str2)); my $c_delta = 0; my $i =0; while ($i < $len) { $c_delta++ if (substr($str1,$i,1) ne substr($str2,$i,1)); + $i++; } print "$str1 $str2 $c_delta\n"; } __END__ ABCGE ABCGE 0 ABCGE FGCGB 3 ABCGE JHAGT 4` [download]	[reply] [d/l]

Replies are listed 'Best First'.

Re^2: Comparing string characters
by Marshall (Canon) on Nov 23, 2021 at 18:56 UTC

An XS procedure written in C could be done very efficiently. A goal would be to reduce the number of main memory cycles. I have a 64 bit machine. There is a bit of setup and cleanup code to focus on the 64 bit aligned block of memory. Read the buffers 8 bytes at a time. Do an XOR operation. If zero, all 8 bytes are the same. If not, then test each byte to see how many bytes differed. An assembly solution probably would provide significant performance increases over the C implementation. This is one of those cases where a human can probably easily beat the compiler. That's because there are some special buffer oriented instructions that are very difficult for the compiler to use effectively.

Anyway some code for comparison..

use strict;
use warnings;
use Data::Dumper;
use List::Util qw(min);


my @tests = (
    [qw{ABCGE ABCGE}],
    [qw{ABCGE FGCGB}],
    [qw{ABCGE JHAGT}],
);


foreach my $arry_ref (@tests)
{
    my ($str1,$str2) = @$arry_ref;
    
    # perhaps optional check to use shortest length
    my $len = min (length ($str1), length($str2));
    
    my $c_delta = 0;
    my $i =0;
    
    while ($i < $len)
    {
        $c_delta++ if (substr($str1,$i,1) ne substr($str2,$i,1));     
+  
        $i++;
    }
    
    print "$str1 $str2 $c_delta\n";
}
__END__
ABCGE ABCGE 0
ABCGE FGCGB 3
ABCGE JHAGT 4
[download]

[reply]
[d/l]