Re: Partial Searches Against Multiple Wildcards
by hv (Prior) on Dec 18, 2023 at 03:24 UTC
|
They'd like to insert an entry like ... D7*D-48*6 ... the expectation being that searching for D7RD or DD482 would return [it]
If I understand correctly, you want a) that each character of the search term match either the same character or * in the corresponding entry; b) that an arbitrary number of characters can appear in the entry between any two characters of the search term.
This corresponds easily to a regular expression: replace each character x of the search term with a choice /x|\*/ (or character class [x*]), and insert a "match anything" pattern /.*/ between each pair:
sub make_regexp {
my($search_term) = @_;
my $str_re = join '.*', map "[$_*]", split //, $search_term;
return qr{$str_re};
}
Since anything can match between search term characters, no special handling is needed for the hyphen. (But if you want to allow them to specify it and have it not match an asterisk, then it will need special handling in the map.)
In terms of precomputing possible matches, the fact that an arbitrary number of characters can appear between any two characters of the search term pretty much rules that out.
Hope this helps. | [reply] [d/l] [select] |
|
|
| [reply] |
|
|
> that an arbitrary number of characters can appear in the entry between any two characters of the search term.
I don't see this. Only that one single hyphen can appear in between.
But the character class approach is nice and should also be translatable to an SQL LIKE.
Best solution so far! Just join with "-?" instead of ".*" to handle hyphens.
👍🏼
update
I think I see your confusion, but DD482 can match the bracketed part of D7(*D-48*)6 it doesn't start with the first D.
| [reply] [d/l] [select] |
Re: Partial Searches Against Multiple Wildcards
by kcott (Archbishop) on Dec 18, 2023 at 07:10 UTC
|
G'day p_jacobson,
Welcome to the Monastery.
The following code meets all of your expectations.
It requires Perl v5.14 for the non-destructive transliteration and substitutions.
#!/usr/bin/env perl
use v5.14;
use warnings;
use constant OLD => 1;
my @data = (
[qw{1 ABFD-1234 AAA}],
[qw{2 ABFD-178G BBB}],
[qw{3 F2HB-9401 AAA}],
[qw{4 ZDDR-00W5 DDD}],
[qw{5 D7*D-48*6 EEE}],
);
my @wild_cards = qw{940 ABFD-1 ABFD1 D7RD DD482};
for my $search (@wild_cards) {
say $search;
my $re = qr{@{[$search
=~ y/-//dr
=~ s//-?/gr
=~ s/([A-Z0-9])/(?:$1|\\*)/gr
]}};
for my $datum (@data) {
say "@$datum" if $datum->[OLD] =~ $re;
}
}
Output:
940
3 F2HB-9401 AAA
ABFD-1
1 ABFD-1234 AAA
2 ABFD-178G BBB
ABFD1
1 ABFD-1234 AAA
2 ABFD-178G BBB
D7RD
5 D7*D-48*6 EEE
DD482
5 D7*D-48*6 EEE
| [reply] [d/l] [select] |
Re: Partial Searches Against Multiple Wildcards
by GrandFather (Saint) on Dec 18, 2023 at 02:17 UTC
|
OK, so it was an interesting problem:
use strict;
use warnings;
use DBI;
my $dbfile = 'D:\Delme~~\demo.db';
unlink $dbfile;
my $dbh = DBI->connect("dbi:SQLite:dbname=$dbfile","","");
$dbh->do (qq{CREATE TABLE Demo (rowid INTEGER, old VARCHAR(9), new VAR
+CHAR(3))});
my @testData = (
['ABFD-1234' => 'AAA'],
['ABFD-178G' => 'BBB'],
['F2HB-9401' => 'AAA'],
['ZDDR-00W5' => 'DDD'],
['D7*D-48*6' => 'EEE'],
['*7RD-4896' => 'FFF'],
['D7RD-489*' => 'GGG'],
);
my @old = map {$_->[0]} @testData;
my @new = map {$_->[1]} @testData;
my $sth = $dbh->prepare (qq{INSERT INTO Demo (old, new) VALUES (?, ?)}
+);
my $entries = $sth->execute_array ({}, \@old, \@new);
$sth = $dbh->prepare (qq{SELECT old, new FROM Demo WHERE old LIKE ?});
$sth->execute ('%*%');
my @candidates = @{$sth->fetchall_arrayref ()};
for my $target ('D7PD', 'D7RD', '4897', '4876') {
for my $candidate (map {$_->[0]} @candidates) {
my @parts = split '\*', $candidate;
my @rParts = reverse map {scalar reverse $_} @parts;
my $str = $target;
my $rStr = reverse $target;
eval {
matchPart($str, $_) || return for @parts;
return !length $str;
}
or eval {
$str = '';
rMatchPart($rStr, $_) || return for @rParts;
return ! length $rStr;
}
or next;
print "Matched '$target' against '$candidate'\n" if !length $s
+tr;
}
}
sub matchPart {
my ($str, $match) = @_;
$match = substr $match, 0, length $_[0];
return 1 if $_[0] =~ s/$match.?//;
}
sub rMatchPart {
my ($str, $match) = @_;
$match = substr $match, 0, length($str) - 1 if length $str;
return 1 if $_[0] =~ s/.?$match//;
}
Prints:
Matched 'D7PD' against 'D7*D-48*6'
Matched 'D7RD' against 'D7*D-48*6'
Matched 'D7RD' against '*7RD-4896'
Matched 'D7RD' against 'D7RD-489*'
Matched '4897' against 'D7RD-489*'
Only marginally tested. There are lots of edge cases that may be problematic, but maybe it's a hint in a useful direction.
Update: removed an edge case.
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: Partial Searches Against Multiple Wildcards
by LanX (Saint) on Dec 17, 2023 at 22:58 UTC
|
seems like your customer wants to define ranges.
> (A) the users and the maintainers are not computer people. To them, the asterisk does not mean one or more characters, it just means any single letter or number in that position.
> (B) the database will only contain a couple thousand rows so loading the entire database into memory is feasible.
> (C) the asterisk will only be the character class [A-Z0-9]
If that's the case, I can see two additional approaches.
- you search the DB for all 2^n combinations of an n-word search-term with positions replaced by *. In your example D7*D would lead to your expected solution. ² °
- you reverse the task by loading all old-keys into memory and create one (or multiple) long or-ed regexes out of it, which you try to match against the search term. When cleverly made optimization should lead to quick results.
I'd start with the first approach b/c it's way easier to implement. If performance becomes a hindrance go for the second one.
²) I really doubt that your customer will try to use many wildcards in the same ID, because this would turn the old->new mapping ambiguous. Anyway you can analyze upfront the max number of * allowed and their minimal distance. This will help you cut down the number of possible combinations considerably from the previous 2^n.
°) in the case of 5 asterix 2^n = 32 is far better than your previous 36^n = 60466176. | [reply] [d/l] [select] |
Re: Partial Searches Against Multiple Wildcards
by parv (Parson) on Dec 17, 2023 at 23:48 UTC
|
Corrected single-character match for SQL pattern, _, instead of ? used in a shell.
No solution, or clue; just thinking out loud ...
They'd like to insert an entry like this:
+-----+-----------+-----+
|rowid| old | new |
+-----+-----------+-----+
| 5 | D7*D-48*6 | EEE |
+-----+-----------+-----+
The expectation being that searching for D7RD or DD482 would return row #5.
... 😬oof, that is reverse of more ideal situation (search for %D7_D% against D7RD-4826).
As is, would have to search for a string character by character (as * could be anywhere), to iteratively reduce the result set: get row indices with D%; then search those for D7% of those row indices; then for D7R% of that result; and then for D7RD% of that result. (Or, work from larger string to smaller string.?)
Thinking of storing D7*D-48*6 as SQL pattern (D7_D-48_6) to search against if that could help any.
And I hate it.
| [reply] [d/l] [select] |
Re: Partial Searches Against Multiple Wildcards
by GrandFather (Saint) on Dec 18, 2023 at 00:01 UTC
|
More thinking out loud:
If there are only a few thousand entries then loading the whole thing into memory is pretty low cost. If you first scan for old keys that contain wild cards you are likely to end up with a manageable group of keys that you can then test one at a time against the search key. One way to do that is to turn the test old key into a regular expression and test for a match against the search key.
Let me know if you want to see some sample code.
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] |
Re: Partial Searches Against Multiple Wildcards
by tybalt89 (Monsignor) on Dec 18, 2023 at 18:27 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11156323
use warnings;
my %data = map { (split)[1], $_ } <DATA>;
for my $searchfor ( qw( 940 ABFD-1 ABFD1 D7RD DD482 ) )
{
print "\n$searchfor\n";
print ' ' x 8, $data{$_} for grep match($searchfor, $_), sort keys
+%data;
}
sub match
{
my ($match, $to) = map tr/-//dr, @_;
$to =~ /\*/ or return $to =~ $match;
my @pattern = map s/\*/[A-Z0-9]/gr, $to =~ /(?=(.{@{[length $match]}
+}))/g;
local $" = '|';
return $match =~ /@pattern/;
}
__DATA__
1 ABFD-1234 AAA
2 ABFD-178G BBB
3 F2HB-9401 AAA
4 ZDDR-00W5 DDD
5 D7*D-48*6 EEE
Outputs:
940
3 F2HB-9401 AAA
ABFD-1
1 ABFD-1234 AAA
2 ABFD-178G BBB
ABFD1
1 ABFD-1234 AAA
2 ABFD-178G BBB
D7RD
5 D7*D-48*6 EEE
DD482
5 D7*D-48*6 EEE
| [reply] [d/l] [select] |
|
|
Apart from the usual "better magic than readable style", nice approach.
You might want to consider to memoize the compiled regexes based on length($match)
This should be considerably faster on the long run.
OTOH it also depends on the proportion of wildcard entries.
If it's only a small percentage your solution should be faster than hv's
Of course he could also distinguish between those entries with wildcards and those without.
| [reply] [d/l] |
Re: Partial Searches Against Multiple Wildcards
by tybalt89 (Monsignor) on Dec 20, 2023 at 15:14 UTC
|
I was looking for a way to solve this without generating a regex. This is the best I have found so far.
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11156323
use warnings;
my %data = map { (split)[1] => $_ } <DATA>;
for my $searchfor ( qw( 940 ABFD-1 ABFD1 D7RD DD482 ) )
{
print "\n$searchfor\n";
print ' ' x 8, $data{$_} for grep match($searchfor, $_), sort keys
+%data;
}
sub match
{
use feature 'bitwise';
my ($match, $to) = map tr/*-/\0/dr, @_;
my $length = length $match;
return $to =~ /(?=(.{$length}))
(??{ ($match ^. $1) =~ tr!0-9A-Z\0!!c ? '(*FAIL)' : '(*ACCEPT)' })
+/gx;
}
__DATA__
1 ABFD-1234 AAA
2 ABFD-178G BBB
3 F2HB-9401 AAA
4 ZDDR-00W5 DDD
5 D7*D-48*6 EEE
| [reply] [d/l] |
Re: Partial Searches Against Multiple Wildcards
by harangzsolt33 (Deacon) on Dec 20, 2023 at 07:13 UTC
|
Although no one can write a better solution than hv, here is a slightly different approach. Here I try to calculate how similar a match is rather than just give a true or false value, match or no match. This programs displays a number 0-9, which means the matches can be sorted by relevance. It may mark similar items as a close match even if the letters or numbers aren't in the exact right order.
#!/usr/bin/perl -w
use strict;
use warnings;
my @LIST = qw(2E4H9A ACD1WB 10DLT4 EFC4Y0 1XAZ9B 3S6UDA AX79C2 CQAJ5F
+DGAK9F 0JQ0A9 1A9AP9 CH9FA3);
my @SEARCHSTR = qw(*A 1* 3*A A*9C*2 *9C2);
print "\n\n";
foreach my $s (@LIST)
{
print "\n";
foreach my $ss (@SEARCHSTR)
{
print( " $s==$ss ", CalcMatch($s, $ss));
}
}
exit;
##################################################
# String | v2023.12.19
# This function compares two strings and returns
# a number between 0 to 9 that shows how similar
# they are. Zero means there are no similarities
# at all, while 9 means the two match perfectly.
# The comparison is not case sensitive.
# The search string may contain asterisks which
# will match a string of any length.
#
# Usage: INTEGER = CalcMatch(String, SearchString)
#
sub CalcMatch
{
defined $_[0] && defined $_[1] or return 1;
my $SL = length($_[0]);
my $FL = length($_[1]);
$SL && $FL or return 1;
my $STR = uc(shift);
my $FIND = uc(shift);
my $SAME = 9;
my $COUNT = CountMatchingBytes($STR, $FIND) or return 0;
my $SIMILAR = int($COUNT / length($STR) * ($SAME - 1));
$FIND =~ tr|*||s; # Delete double asterisks
my $ACOUNT = $FIND =~ tr|*|*|; # Count asterisks
if ($ACOUNT == 0) { return $STR eq $FIND ? $SAME : $SIMILAR; }
if ($ACOUNT == length($FIND)) { return $SAME; }
# Locate first and last asterisk
my $FIRST = index($FIND, '*');
my $LAST = ($ACOUNT > 1) ? rindex($FIND, '*') : $FIRST;
# Here we're going to "cut" both strings into three parts:
# 1) the part before first asterisk
# 2) the part between two asterisks
# 3) the part after last asterisk
# And then we're going to compare each except the middle one,
# because we just do a search on the middle string.
# Okay, so let's compare the first section
# IF the pattern doesn't start with an asterisk:
if ($FIRST > 0)
{
substr($STR, 0, $FIRST) eq substr($FIND, 0, $FIRST) or return $SIM
+ILAR;
}
# Let's compare the last section
# IF the pattern doesn't end with an asterisk.
if ($LAST < length($FIND) - 1)
{
my $LEN = length($FIND) - $LAST - 1;
substr($STR, -$LEN) eq substr($FIND, -$LEN) or return $SIMILAR;
}
# Now, the part between the two asterisks must appear
# somewhere in the middle of the string.
if ($ACOUNT > 1 && $LAST > $FIRST)
{
my $MIDLEN = length($STR) - $FIRST - (length($FIND) - $LAST) + 1;
return index(substr($STR, $FIRST, $MIDLEN), substr($FIND, $FIRST+1
+, $LAST - $FIRST - 1)) < 0 ? $SIMILAR : $SAME;
}
# More than 2 asterisk are not supported.
if ($ACOUNT > 2)
{
return $SIMILAR; # I just noticed that this line will probably n
+ot get executed ever. lol
}
return $SAME;
}
##################################################
# String | v2023.12.19
# This function counts how many of the characters in
# STR1 match characters listed in STR2.
#
# Usage: COUNT = CountMatchingBytes(STR1, STR2)
#
sub CountMatchingBytes
{
defined $_[0] && defined $_[1] or return 0;
my $P = length($_[0]) or return 0;
my $C = 0; length($_[1]) or return 0;
while ($P--)
{ index($_[1], substr($_[0], $P, 1)) < 0 or $C++; }
return $C;
}
By the way, I love this thread! This was a very thought-provoking question for me. :D | [reply] [d/l] |
Re: Partial Searches Against Multiple Wildcards
by Bod (Parson) on Dec 18, 2023 at 21:23 UTC
|
Do you have any suggestions for perl-ish ways to tackle this?
My first thought is not to use Perl to do this, as you have said you are using a database.
Substitute the * wildcard characters for _ SQL wildcard characters and feed that directly into your SQL query.
my $search = 'D7*D';
$search =~ /*/_/g;
my ($id, $new) = $dbh->selectrow_array("SELECT rowid, new FROM table W
+HERE old LIKE ?", undef, $search);
(untested) | [reply] [d/l] [select] |
|
|
I think you misunderstood the requirements.
| [reply] |
|
|
| [reply] |
Re: Partial Searches Against Multiple Wildcards
by Anonymous Monk on Dec 18, 2023 at 02:05 UTC
|
Since there are only a few thousand, just save a regular expression for each one:
my $v = "D7*D-48*6";
my $r = $v;
$r =~ s/-/-?/g; # Dash optional
$r =~ s/\*/./g; # * any character
("D7RD" =~ m/$r/) ? ....
| [reply] [d/l] |
|
|
| [reply] |
A reply falls below the community's threshold of quality. You may see it by logging in.
|