Comparing string characters


No such thing as a small change
	PerlMonks

Comparing string characters

by dnamonk (Acolyte)

on Nov 22, 2021 at 16:35 UTC ( [id://11139026]=perlquestion: print w/replies, xml )

Need Help??

dnamonk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am wondering what is the easiest way to compare substrings to know how many mismatches are there. For example:

String1 String2 #Mismatches

ABCGE ABCGE 0

ABCGE FGCGB 3

ABCGE JHAGT 4

I was thinking to compare arrays. Or maybe create a hash array and then use index() ?

Thanks in advance :)

Comment on Comparing string characters

Replies are listed 'Best First'.
Re: Comparing string characters by choroba (Cardinal) on Nov 22, 2021 at 16:42 UTC
It seems like you're interested in the edit distance. There are several different algorithms, and the most well-known is Levenshtein distance, implemented for example by Text::Levenshtein. `#!/usr/bin/perl use warnings; use strict; use Text::Levenshtein qw{ distance }; use Test::More tests => 3; is distance('ABCGE', 'ABCGE'), 0; is distance('ABCGE', 'FGCGB'), 3; is distance('ABCGE', 'JHAGT'), 4;` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^2: Comparing string characters by dnamonk (Acolyte) on Nov 22, 2021 at 18:19 UTC
Thanks. That was really helpful :)	[reply]
Re^2: Comparing string characters by LanX (Saint) on Nov 22, 2021 at 17:42 UTC
Levenshtein might be overkill, I can't see a requirement for "edit distance" in these samples Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re: Comparing string characters by kcott (Archbishop) on Nov 22, 2021 at 23:50 UTC
G'day dnamonk, Code: `#!/usr/bin/env perl use strict; use warnings; my @tests = ( [qw{ABCGE ABCGE}], [qw{ABCGE FGCGB}], [qw{ABCGE JHAGT}], ); for my $strings (@tests) { my $diff = 0; for my $i (0 .. length($strings->[0]) - 1) { my @chars = map substr($strings->[$_], $i, 1), 0, 1; ++$diff if $chars[0] ne $chars[1]; } print "@$strings $diff\n"; }` [download] Output: `ABCGE ABCGE 0 ABCGE FGCGB 3 ABCGE JHAGT 4` [download] You've received a number of solutions; use Benchmark to see which is the most efficient. From your username, I'm guessing you're dealing with biological data: typically huge and efficiency is usually important. — Ken	[reply] [d/l] [select]
Re^2: Comparing string characters by Marshall (Canon) on Nov 23, 2021 at 18:56 UTC
If there is a desire to increase efficiency, I would get rid of the map which is actually a loop, and also the array indexing. Code shown below. Of course, as you suggest benchmarking is absolutely necessary if high performance is desired. The example test strings need to be much longer because now with just 5 characters, the setup code dwarfs the actual comparison code. An XS procedure written in C could be done very efficiently. A goal would be to reduce the number of main memory cycles. I have a 64 bit machine. There is a bit of setup and cleanup code to focus on the 64 bit aligned block of memory. Read the buffers 8 bytes at a time. Do an XOR operation. If zero, all 8 bytes are the same. If not, then test each byte to see how many bytes differed. An assembly solution probably would provide significant performance increases over the C implementation. This is one of those cases where a human can probably easily beat the compiler. That's because there are some special buffer oriented instructions that are very difficult for the compiler to use effectively. Anyway some code for comparison.. `use strict; use warnings; use Data::Dumper; use List::Util qw(min); my @tests = ( [qw{ABCGE ABCGE}], [qw{ABCGE FGCGB}], [qw{ABCGE JHAGT}], ); foreach my $arry_ref (@tests) { my ($str1,$str2) = @$arry_ref; # perhaps optional check to use shortest length my $len = min (length ($str1), length($str2)); my $c_delta = 0; my $i =0; while ($i < $len) { $c_delta++ if (substr($str1,$i,1) ne substr($str2,$i,1)); + $i++; } print "$str1 $str2 $c_delta\n"; } __END__ ABCGE ABCGE 0 ABCGE FGCGB 3 ABCGE JHAGT 4` [download]	[reply] [d/l]
Re: Comparing string characters by LanX (Saint) on Nov 22, 2021 at 17:31 UTC
> easiest way to compare substrings `use v5.12; use warnings; my $x ='ABCGE ABCGE'; my $y ='ABCGE FGCGB'; my $z ='ABCGE JHAGT'; say scalar ($x ^ $y) =~ tr/\0//c; say scalar ($x ^ $z) =~ tr/\0//c;` [download] `3 4` [download] edit Oh I probably misread your examples, but if you compare first and second column you'll get the same results. update covering both cases `use v5.12; use warnings; use Data::Dump qw/pp dd/; use Test::More; sub distance { my ( $x,$y ) = @_ ; return scalar (("$x" ^ "$y") =~ tr/\0//c ); } my $x ='ABCGE ABCGE'; my $y ='ABCGE FGCGB'; my $z ='ABCGE JHAGT'; is( distance($x,$y) => 3, "by row"); is( distance($x,$z) => 4, "by row"); while (<DATA>){ my @col = split; is( distance($col[0],$col[1]) => $col[2],"by col"); } done_testing; __DATA__ ABCGE ABCGE 0 ABCGE FGCGB 3 ABCGE JHAGT 4` [download] Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^2: Comparing string characters by dnamonk (Acolyte) on Nov 22, 2021 at 18:20 UTC
Thanks a lot. That was really helpful :)	[reply]

Back to Seekers of Perl Wisdom

Log In^?

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: perlquestion [id://11139026]
Approved by marto
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others sharing their wisdom with the Monastery: (4)

As of 2024-04-19 03:18 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found