Partial Searches Against Multiple Wildcards

p_jacobson has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Partial Searches Against Multiple Wildcards by hv (Prior) on Dec 18, 2023 at 03:24 UTC
They'd like to insert an entry like ... `D7D-486` ... the expectation being that searching for D7RD or DD482 would return [it] If I understand correctly, you want a) that each character of the search term match either the same character or `` in the corresponding entry; b) that an arbitrary number of characters can appear in the entry between any two characters of the search term. This corresponds easily to a regular expression: replace each character `x` of the search term with a choice `/x\|\/` (or character class `[x]`), and insert a "match anything" pattern `/./` between each pair: `sub make_regexp { my($search_term) = @_; my $str_re = join '.', map "[$_]", split //, $search_term; return qr{$str_re}; }` [download] Since anything can match between search term characters, no special handling is needed for the hyphen. (But if you want to allow them to specify it and have it not match an asterisk, then it will need special handling in the `map`.) In terms of precomputing possible matches, the fact that an arbitrary number of characters can appear between any two characters of the search term pretty much rules that out. Hope this helps.	[reply] [d/l] [select]
Re^2: Partial Searches Against Multiple Wildcards by p_jacobson (Novice) on Dec 18, 2023 at 04:18 UTC
Thank you. This is a great solution and why I needed some fresh eyes looking at this. As noted by LanX, there is no arbitrary number of characters where an asterisk appears. Joining with -? handles my use-case and passes all my tests. Thanks again for your answer.	[reply]
Re^2: Partial Searches Against Multiple Wildcards by LanX (Saint) on Dec 18, 2023 at 03:50 UTC
> that an arbitrary number of characters can appear in the entry between any two characters of the search term. I don't see this. Only that one single hyphen can appear in between. But the character class approach is nice and should also be translatable to an SQL LIKE. Best solution so far! Just join with `"-?"` instead of `"."` to handle hyphens. 👍🏼 update I think I see your confusion, but `DD482` can match the bracketed part of `D7(D-48*)6` it doesn't start with the first D. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re: Partial Searches Against Multiple Wildcards by kcott (Archbishop) on Dec 18, 2023 at 07:10 UTC
G'day p_jacobson, Welcome to the Monastery. The following code meets all of your expectations. It requires Perl `v5.14` for the non-destructive transliteration and substitutions. `#!/usr/bin/env perl use v5.14; use warnings; use constant OLD => 1; my @data = ( [qw{1 ABFD-1234 AAA}], [qw{2 ABFD-178G BBB}], [qw{3 F2HB-9401 AAA}], [qw{4 ZDDR-00W5 DDD}], [qw{5 D7D-486 EEE}], ); my @wild_cards = qw{940 ABFD-1 ABFD1 D7RD DD482}; for my $search (@wild_cards) { say $search; my $re = qr{@{[$search =~ y/-//dr =~ s//-?/gr =~ s/([A-Z0-9])/(?:$1\|\\)/gr ]}}; for my $datum (@data) { say "@$datum" if $datum->[OLD] =~ $re; } }` [download] Output: `940 3 F2HB-9401 AAA ABFD-1 1 ABFD-1234 AAA 2 ABFD-178G BBB ABFD1 1 ABFD-1234 AAA 2 ABFD-178G BBB D7RD 5 D7D-486 EEE DD482 5 D7D-48*6 EEE` [download] — Ken	[reply] [d/l] [select]
Re: Partial Searches Against Multiple Wildcards by GrandFather (Saint) on Dec 18, 2023 at 02:17 UTC
OK, so it was an interesting problem: use strict; use warnings; use DBI; my $dbfile = 'D:\Delme~~\demo.db'; unlink $dbfile; my $dbh = DBI->connect("dbi:SQLite:dbname=$dbfile","",""); $dbh->do (qq{CREATE TABLE Demo (rowid INTEGER, old VARCHAR(9), new VAR +CHAR(3))}); my @testData = ( ['ABFD-1234' => 'AAA'], ['ABFD-178G' => 'BBB'], ['F2HB-9401' => 'AAA'], ['ZDDR-00W5' => 'DDD'], ['D7D-486' => 'EEE'], ['7RD-4896' => 'FFF'], ['D7RD-489' => 'GGG'], ); my @old = map {$_->[0]} @testData; my @new = map {$_->[1]} @testData; my $sth = $dbh->prepare (qq{INSERT INTO Demo (old, new) VALUES (?, ?)} +); my $entries = $sth->execute_array ({}, \@old, \@new); $sth = $dbh->prepare (qq{SELECT old, new FROM Demo WHERE old LIKE ?}); $sth->execute ('%%'); my @candidates = @{$sth->fetchall_arrayref ()}; for my $target ('D7PD', 'D7RD', '4897', '4876') { for my $candidate (map {$_->[0]} @candidates) { my @parts = split '\', $candidate; my @rParts = reverse map {scalar reverse $_} @parts; my $str = $target; my $rStr = reverse $target; eval { matchPart($str, $_) \|\| return for @parts; return !length $str; } or eval { $str = ''; rMatchPart($rStr, $_) \|\| return for @rParts; return ! length $rStr; } or next; print "Matched '$target' against '$candidate'\n" if !length $s +tr; } } sub matchPart { my ($str, $match) = @_; $match = substr $match, 0, length $_[0]; return 1 if $_[0] =~ s/$match.?//; } sub rMatchPart { my ($str, $match) = @_; $match = substr $match, 0, length($str) - 1 if length $str; return 1 if $_[0] =~ s/.?$match//; } [download] Prints: `Matched 'D7PD' against 'D7D-486' Matched 'D7RD' against 'D7D-486' Matched 'D7RD' against '7RD-4896' Matched 'D7RD' against 'D7RD-489' Matched '4897' against 'D7RD-489'` [download] Only marginally tested. There are lots of edge cases that may be problematic, but maybe it's a hint in a useful direction. Update:* removed an edge case. Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply] [d/l] [select]
Re^2: Partial Searches Against Multiple Wildcards by p_jacobson (Novice) on Dec 18, 2023 at 04:37 UTC
Thank you for the example code. It was an interesting solution that got me moving in the right direction. It works well when the search string matches from the beginning of (or all the way to the end of) the database string. I'll probably use hv's solution for this particular project but I think I'll try and flesh out your solution a bit more so I have another option if needed. Thanks again for taking a look.	[reply]
Re: Partial Searches Against Multiple Wildcards by LanX (Saint) on Dec 17, 2023 at 22:58 UTC
seems like your customer wants to define ranges. > (A) the users and the maintainers are not computer people. To them, the asterisk does not mean one or more characters, it just means any single letter or number in that position. > (B) the database will only contain a couple thousand rows so loading the entire database into memory is feasible. > (C) the asterisk will only be the character class `[A-Z0-9]` If that's the case, I can see two additional approaches. you search the DB for all 2^n combinations of an n-word search-term with positions replaced by . In your example `D7D` would lead to your expected solution. � � you reverse the task by loading all old-keys into memory and create one (or multiple) long or-ed regexes out of it, which you try to match against the search term. When cleverly made optimization should lead to quick results. I'd start with the first approach b/c it's way easier to implement. If performance becomes a hindrance go for the second one. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery} �) I really doubt that your customer will try to use many wildcards in the same ID, because this would turn the old->new mapping ambiguous. Anyway you can analyze upfront the max number of `*` allowed and their minimal distance. This will help you cut down the number of possible combinations considerably from the previous 2^n. �) in the case of 5 asterix 2^n = 32 is far better than your previous 36^n = 60466176.	[reply] [d/l] [select]
Re: Partial Searches Against Multiple Wildcards by parv (Parson) on Dec 17, 2023 at 23:48 UTC
Corrected single-character match for SQL pattern, `_`, instead of `?` used in a shell. No solution, or clue; just thinking out loud ... They'd like to insert an entry like this: `+-----+-----------+-----+ \|rowid\| old \| new \| +-----+-----------+-----+ \| 5 \| D7D-486 \| EEE \| +-----+-----------+-----+` [download] The expectation being that searching for D7RD or DD482 would return row #5. ... 😬oof, that is reverse of more ideal situation (search for `%D7_D%` against `D7RD-4826`). As is, would have to search for a string character by character (as `` could be anywhere), to iteratively reduce the result set: get row indices with `D%`; then search those for `D7%` of those row indices; then for `D7R%` of that result; and then for `D7RD%` of that result. (Or, work from larger string to smaller string.?) Thinking of storing `D7D-48*6` as SQL pattern (`D7_D-48_6`) to search against if that could help any. And I hate it.	[reply] [d/l] [select]
Re: Partial Searches Against Multiple Wildcards by GrandFather (Saint) on Dec 18, 2023 at 00:01 UTC
More thinking out loud: If there are only a few thousand entries then loading the whole thing into memory is pretty low cost. If you first scan for old keys that contain wild cards you are likely to end up with a manageable group of keys that you can then test one at a time against the search key. One way to do that is to turn the test old key into a regular expression and test for a match against the search key. Let me know if you want to see some sample code. Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply]
Re: Partial Searches Against Multiple Wildcards by tybalt89 (Monsignor) on Dec 18, 2023 at 18:27 UTC
Just an idea #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11156323 use warnings; my %data = map { (split)[1], $_ } <DATA>; for my $searchfor ( qw( 940 ABFD-1 ABFD1 D7RD DD482 ) ) { print "\n$searchfor\n"; print ' ' x 8, $data{$_} for grep match($searchfor, $_), sort keys +%data; } sub match { my ($match, $to) = map tr/-//dr, @_; $to =~ /\/ or return $to =~ $match; my @pattern = map s/\/[A-Z0-9]/gr, $to =~ /(?=(.{@{[length $match]} +}))/g; local $" = '\|'; return $match =~ /@pattern/; } __DATA__ 1 ABFD-1234 AAA 2 ABFD-178G BBB 3 F2HB-9401 AAA 4 ZDDR-00W5 DDD 5 D7D-486 EEE [download] Outputs: `940 3 F2HB-9401 AAA ABFD-1 1 ABFD-1234 AAA 2 ABFD-178G BBB ABFD1 1 ABFD-1234 AAA 2 ABFD-178G BBB D7RD 5 D7D-486 EEE DD482 5 D7D-486 EEE` [download]	[reply] [d/l] [select]
Re^2: Partial Searches Against Multiple Wildcards by LanX (Saint) on Dec 18, 2023 at 21:49 UTC
Apart from the usual "better magic than readable style", nice approach. You might want to consider to memoize the compiled regexes based on `length($match)` This should be considerably faster on the long run. OTOH it also depends on the proportion of wildcard entries. If it's only a small percentage your solution should be faster than hv's Of course he could also distinguish between those entries with wildcards and those without. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l]
Re: Partial Searches Against Multiple Wildcards by tybalt89 (Monsignor) on Dec 20, 2023 at 15:14 UTC
I was looking for a way to solve this without generating a regex. This is the best I have found so far. #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11156323 use warnings; my %data = map { (split)[1] => $_ } <DATA>; for my $searchfor ( qw( 940 ABFD-1 ABFD1 D7RD DD482 ) ) { print "\n$searchfor\n"; print ' ' x 8, $data{$_} for grep match($searchfor, $_), sort keys +%data; } sub match { use feature 'bitwise'; my ($match, $to) = map tr/-/\0/dr, @_; my $length = length $match; return $to =~ /(?=(.{$length})) (??{ ($match ^. $1) =~ tr!0-9A-Z\0!!c ? '(FAIL)' : '(ACCEPT)' }) +/gx; } __DATA__ 1 ABFD-1234 AAA 2 ABFD-178G BBB 3 F2HB-9401 AAA 4 ZDDR-00W5 DDD 5 D7D-48*6 EEE [download]	[reply] [d/l]
Re: Partial Searches Against Multiple Wildcards by harangzsolt33 (Deacon) on Dec 20, 2023 at 07:13 UTC
Although no one can write a better solution than hv, here is a slightly different approach. Here I try to calculate how similar a match is rather than just give a true or false value, match or no match. This programs displays a number 0-9, which means the matches can be sorted by relevance. It may mark similar items as a close match even if the letters or numbers aren't in the exact right order. #!/usr/bin/perl -w use strict; use warnings; my @LIST = qw(2E4H9A ACD1WB 10DLT4 EFC4Y0 1XAZ9B 3S6UDA AX79C2 CQAJ5F +DGAK9F 0JQ0A9 1A9AP9 CH9FA3); my @SEARCHSTR = qw(A 1 3A A9C2 9C2); print "\n\n"; foreach my $s (@LIST) { print "\n"; foreach my $ss (@SEARCHSTR) { print( " $s==$ss ", CalcMatch($s, $ss)); } } exit; ################################################## # String \| v2023.12.19 # This function compares two strings and returns # a number between 0 to 9 that shows how similar # they are. Zero means there are no similarities # at all, while 9 means the two match perfectly. # The comparison is not case sensitive. # The search string may contain asterisks which # will match a string of any length. # # Usage: INTEGER = CalcMatch(String, SearchString) # sub CalcMatch { defined $_[0] && defined $_[1] or return 1; my $SL = length($_[0]); my $FL = length($_[1]); $SL && $FL or return 1; my $STR = uc(shift); my $FIND = uc(shift); my $SAME = 9; my $COUNT = CountMatchingBytes($STR, $FIND) or return 0; my $SIMILAR = int($COUNT / length($STR) * ($SAME - 1)); $FIND =~ tr\|\|\|s; # Delete double asterisks my $ACOUNT = $FIND =~ tr\|\|\|; # Count asterisks if ($ACOUNT == 0) { return $STR eq $FIND ? $SAME : $SIMILAR; } if ($ACOUNT == length($FIND)) { return $SAME; } # Locate first and last asterisk my $FIRST = index($FIND, ''); my $LAST = ($ACOUNT > 1) ? rindex($FIND, '*') : $FIRST; # Here we're going to "cut" both strings into three parts: # 1) the part before first asterisk # 2) the part between two asterisks # 3) the part after last asterisk # And then we're going to compare each except the middle one, # because we just do a search on the middle string. # Okay, so let's compare the first section # IF the pattern doesn't start with an asterisk: if ($FIRST > 0) { substr($STR, 0, $FIRST) eq substr($FIND, 0, $FIRST) or return $SIM +ILAR; } # Let's compare the last section # IF the pattern doesn't end with an asterisk. if ($LAST < length($FIND) - 1) { my $LEN = length($FIND) - $LAST - 1; substr($STR, -$LEN) eq substr($FIND, -$LEN) or return $SIMILAR; } # Now, the part between the two asterisks must appear # somewhere in the middle of the string. if ($ACOUNT > 1 && $LAST > $FIRST) { my $MIDLEN = length($STR) - $FIRST - (length($FIND) - $LAST) + 1; return index(substr($STR, $FIRST, $MIDLEN), substr($FIND, $FIRST+1 +, $LAST - $FIRST - 1)) < 0 ? $SIMILAR : $SAME; } # More than 2 asterisk are not supported. if ($ACOUNT > 2) { return $SIMILAR; # I just noticed that this line will probably n +ot get executed ever. lol } return $SAME; } ################################################## # String \| v2023.12.19 # This function counts how many of the characters in # STR1 match characters listed in STR2. # # Usage: COUNT = CountMatchingBytes(STR1, STR2) # sub CountMatchingBytes { defined $_[0] && defined $_[1] or return 0; my $P = length($_[0]) or return 0; my $C = 0; length($_[1]) or return 0; while ($P--) { index($_[1], substr($_[0], $P, 1)) < 0 or $C++; } return $C; } [download] By the way, I love this thread! This was a very thought-provoking question for me. :D	[reply] [d/l]
Re: Partial Searches Against Multiple Wildcards by Bod (Parson) on Dec 18, 2023 at 21:23 UTC
Do you have any suggestions for perl-ish ways to tackle this? My first thought is not to use Perl to do this, as you have said you are using a database. Substitute the `` wildcard characters for `_` SQL wildcard characters and feed that directly into your SQL query. `my $search = 'D7D'; $search =~ //_/g; my ($id, $new) = $dbh->selectrow_array("SELECT rowid, new FROM table W +HERE old LIKE ?", undef, $search);` [download] (untested)*	[reply] [d/l] [select]
Re^2: Partial Searches Against Multiple Wildcards by LanX (Saint) on Dec 18, 2023 at 21:41 UTC
I think you misunderstood the requirements. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply]
Re^3: Partial Searches Against Multiple Wildcards by Bod (Parson) on Dec 18, 2023 at 21:58 UTC
Ah yes! Re-reading it - perhaps I did...	[reply]
Re: Partial Searches Against Multiple Wildcards by Anonymous Monk on Dec 18, 2023 at 02:05 UTC
Since there are only a few thousand, just save a regular expression for each one: `my $v = "D7D-486"; my $r = $v; $r =~ s/-/-?/g; # Dash optional $r =~ s/\/./g; # any character ("D7RD" =~ m/$r/) ? ....` [download]	[reply] [d/l]
Re^2: Partial Searches Against Multiple Wildcards by LanX (Saint) on Dec 18, 2023 at 02:24 UTC
Which obviously doesn't work. The pattern can be a sub-part of the string, not the other way round. You should test before posting. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.

update