p_jacobson has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks. I'm building a cross reference utility in Perl and I've run into something that has me stumped. The utility is very basic, users search for an old item number and the utility returns a list of matches. The database is simple:

+-----+-----------+-----+ |rowid| old | new | +-----+-----------+-----+ | 1 | ABFD-1234 | AAA | | 2 | ABFD-178G | BBB | | 3 | F2HB-9401 | AAA | | 4 | ZDDR-00W5 | DDD | +-----+-----------+-----+

I've implemented the ability to search anywhere in the old item number (searching for 940 returns row #3) and the ability to not enter the hyphen (searching for ABFD-1 or ABFD1 returns rows #1 and #2.) The users maintaining the utility have asked to use wildcards in the old item number. They'd like to insert an entry like this:

+-----+-----------+-----+ |rowid| old | new | +-----+-----------+-----+ | 5 | D7*D-48*6 | EEE | +-----+-----------+-----+

The expectation being that searching for D7RD or DD482 would return row #5. This wildcard request is what has me stumped. I haven't figured out a sensible way to implement it and I'm thinking I need new sets of eyes to lead me down the right path.

A few things that are potentially relevant: (A) the users and the maintainers are not computer people. To them, the asterisk does not mean one or more characters, it just means any single letter or number in that position. (B) the database will only contain a couple thousand rows so loading the entire database into memory is feasible. (C) the asterisk will only be the character class [A-Z0-9]

One option I've tested is to take any old item number with an asterisk and pre-generate all possible combinations. The challenge with this option is that there are old item numbers that have 5 asterisks which means creating 60 million additional entries for that one item. I've also tried building the 60 million rows in memory and searching against those but the application performance degrades as more wildcard entries are added into the database.

Do you have any suggestions for perl-ish ways to tackle this? I don't really need code examples, just ideas that would be performant.

Replies are listed 'Best First'.
Re: Partial Searches Against Multiple Wildcards
by hv (Prior) on Dec 18, 2023 at 03:24 UTC

    They'd like to insert an entry like ... D7*D-48*6 ... the expectation being that searching for D7RD or DD482 would return [it]

    If I understand correctly, you want a) that each character of the search term match either the same character or * in the corresponding entry; b) that an arbitrary number of characters can appear in the entry between any two characters of the search term.

    This corresponds easily to a regular expression: replace each character x of the search term with a choice /x|\*/ (or character class [x*]), and insert a "match anything" pattern /.*/ between each pair:

    sub make_regexp { my($search_term) = @_; my $str_re = join '.*', map "[$_*]", split //, $search_term; return qr{$str_re}; }

    Since anything can match between search term characters, no special handling is needed for the hyphen. (But if you want to allow them to specify it and have it not match an asterisk, then it will need special handling in the map.)

    In terms of precomputing possible matches, the fact that an arbitrary number of characters can appear between any two characters of the search term pretty much rules that out.

    Hope this helps.

      Thank you. This is a great solution and why I needed some fresh eyes looking at this. As noted by LanX, there is no arbitrary number of characters where an asterisk appears. Joining with -? handles my use-case and passes all my tests. Thanks again for your answer.

      > that an arbitrary number of characters can appear in the entry between any two characters of the search term.

      I don't see this. Only that one single hyphen can appear in between.

      But the character class approach is nice and should also be translatable to an SQL LIKE.

      Best solution so far! Just join with "-?" instead of ".*" to handle hyphens.

      👍🏼

      update

      I think I see your confusion, but DD482 can match the bracketed part of D7(*D-48*)6 it doesn't start with the first D.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

Re: Partial Searches Against Multiple Wildcards
by kcott (Archbishop) on Dec 18, 2023 at 07:10 UTC

    G'day p_jacobson,

    Welcome to the Monastery.

    The following code meets all of your expectations. It requires Perl v5.14 for the non-destructive transliteration and substitutions.

    #!/usr/bin/env perl use v5.14; use warnings; use constant OLD => 1; my @data = ( [qw{1 ABFD-1234 AAA}], [qw{2 ABFD-178G BBB}], [qw{3 F2HB-9401 AAA}], [qw{4 ZDDR-00W5 DDD}], [qw{5 D7*D-48*6 EEE}], ); my @wild_cards = qw{940 ABFD-1 ABFD1 D7RD DD482}; for my $search (@wild_cards) { say $search; my $re = qr{@{[$search =~ y/-//dr =~ s//-?/gr =~ s/([A-Z0-9])/(?:$1|\\*)/gr ]}}; for my $datum (@data) { say "@$datum" if $datum->[OLD] =~ $re; } }

    Output:

    940 3 F2HB-9401 AAA ABFD-1 1 ABFD-1234 AAA 2 ABFD-178G BBB ABFD1 1 ABFD-1234 AAA 2 ABFD-178G BBB D7RD 5 D7*D-48*6 EEE DD482 5 D7*D-48*6 EEE

    — Ken

Re: Partial Searches Against Multiple Wildcards
by GrandFather (Saint) on Dec 18, 2023 at 02:17 UTC

    OK, so it was an interesting problem:

    use strict; use warnings; use DBI; my $dbfile = 'D:\Delme~~\demo.db'; unlink $dbfile; my $dbh = DBI->connect("dbi:SQLite:dbname=$dbfile","",""); $dbh->do (qq{CREATE TABLE Demo (rowid INTEGER, old VARCHAR(9), new VAR +CHAR(3))}); my @testData = ( ['ABFD-1234' => 'AAA'], ['ABFD-178G' => 'BBB'], ['F2HB-9401' => 'AAA'], ['ZDDR-00W5' => 'DDD'], ['D7*D-48*6' => 'EEE'], ['*7RD-4896' => 'FFF'], ['D7RD-489*' => 'GGG'], ); my @old = map {$_->[0]} @testData; my @new = map {$_->[1]} @testData; my $sth = $dbh->prepare (qq{INSERT INTO Demo (old, new) VALUES (?, ?)} +); my $entries = $sth->execute_array ({}, \@old, \@new); $sth = $dbh->prepare (qq{SELECT old, new FROM Demo WHERE old LIKE ?}); $sth->execute ('%*%'); my @candidates = @{$sth->fetchall_arrayref ()}; for my $target ('D7PD', 'D7RD', '4897', '4876') { for my $candidate (map {$_->[0]} @candidates) { my @parts = split '\*', $candidate; my @rParts = reverse map {scalar reverse $_} @parts; my $str = $target; my $rStr = reverse $target; eval { matchPart($str, $_) || return for @parts; return !length $str; } or eval { $str = ''; rMatchPart($rStr, $_) || return for @rParts; return ! length $rStr; } or next; print "Matched '$target' against '$candidate'\n" if !length $s +tr; } } sub matchPart { my ($str, $match) = @_; $match = substr $match, 0, length $_[0]; return 1 if $_[0] =~ s/$match.?//; } sub rMatchPart { my ($str, $match) = @_; $match = substr $match, 0, length($str) - 1 if length $str; return 1 if $_[0] =~ s/.?$match//; }

    Prints:

    Matched 'D7PD' against 'D7*D-48*6' Matched 'D7RD' against 'D7*D-48*6' Matched 'D7RD' against '*7RD-4896' Matched 'D7RD' against 'D7RD-489*' Matched '4897' against 'D7RD-489*'

    Only marginally tested. There are lots of edge cases that may be problematic, but maybe it's a hint in a useful direction.

    Update: removed an edge case.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

      Thank you for the example code. It was an interesting solution that got me moving in the right direction. It works well when the search string matches from the beginning of (or all the way to the end of) the database string. I'll probably use hv's solution for this particular project but I think I'll try and flesh out your solution a bit more so I have another option if needed. Thanks again for taking a look.

Re: Partial Searches Against Multiple Wildcards
by LanX (Saint) on Dec 17, 2023 at 22:58 UTC
    seems like your customer wants to define ranges.

    > (A) the users and the maintainers are not computer people. To them, the asterisk does not mean one or more characters, it just means any single letter or number in that position.

    > (B) the database will only contain a couple thousand rows so loading the entire database into memory is feasible.

    > (C) the asterisk will only be the character class [A-Z0-9]

    If that's the case, I can see two additional approaches.

    1. you search the DB for all 2^n combinations of an n-word search-term with positions replaced by *. In your example D7*D would lead to your expected solution. ² °
    2. you reverse the task by loading all old-keys into memory and create one (or multiple) long or-ed regexes out of it, which you try to match against the search term. When cleverly made optimization should lead to quick results.

    I'd start with the first approach b/c it's way easier to implement. If performance becomes a hindrance go for the second one.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

    ²) I really doubt that your customer will try to use many wildcards in the same ID, because this would turn the old->new mapping ambiguous. Anyway you can analyze upfront the max number of * allowed and their minimal distance. This will help you cut down the number of possible combinations considerably from the previous 2^n.

    °) in the case of 5 asterix 2^n = 32 is far better than your previous 36^n = 60466176.

Re: Partial Searches Against Multiple Wildcards
by parv (Parson) on Dec 17, 2023 at 23:48 UTC

    Corrected single-character match for SQL pattern, _, instead of ? used in a shell.

    No solution, or clue; just thinking out loud ...

    They'd like to insert an entry like this:

    +-----+-----------+-----+ |rowid| old | new | +-----+-----------+-----+ | 5 | D7*D-48*6 | EEE | +-----+-----------+-----+

    The expectation being that searching for D7RD or DD482 would return row #5.

    ... 😬oof, that is reverse of more ideal situation (search for %D7_D% against D7RD-4826).

    As is, would have to search for a string character by character (as * could be anywhere), to iteratively reduce the result set: get row indices with D%; then search those for D7% of those row indices; then for D7R% of that result; and then for D7RD% of that result. (Or, work from larger string to smaller string.?)

    Thinking of storing D7*D-48*6 as SQL pattern (D7_D-48_6) to search against if that could help any.

    And I hate it.

Re: Partial Searches Against Multiple Wildcards
by GrandFather (Saint) on Dec 18, 2023 at 00:01 UTC

    More thinking out loud:

    If there are only a few thousand entries then loading the whole thing into memory is pretty low cost. If you first scan for old keys that contain wild cards you are likely to end up with a manageable group of keys that you can then test one at a time against the search key. One way to do that is to turn the test old key into a regular expression and test for a match against the search key.

    Let me know if you want to see some sample code.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: Partial Searches Against Multiple Wildcards
by tybalt89 (Monsignor) on Dec 18, 2023 at 18:27 UTC

    Just an idea

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11156323 use warnings; my %data = map { (split)[1], $_ } <DATA>; for my $searchfor ( qw( 940 ABFD-1 ABFD1 D7RD DD482 ) ) { print "\n$searchfor\n"; print ' ' x 8, $data{$_} for grep match($searchfor, $_), sort keys +%data; } sub match { my ($match, $to) = map tr/-//dr, @_; $to =~ /\*/ or return $to =~ $match; my @pattern = map s/\*/[A-Z0-9]/gr, $to =~ /(?=(.{@{[length $match]} +}))/g; local $" = '|'; return $match =~ /@pattern/; } __DATA__ 1 ABFD-1234 AAA 2 ABFD-178G BBB 3 F2HB-9401 AAA 4 ZDDR-00W5 DDD 5 D7*D-48*6 EEE

    Outputs:

    940 3 F2HB-9401 AAA ABFD-1 1 ABFD-1234 AAA 2 ABFD-178G BBB ABFD1 1 ABFD-1234 AAA 2 ABFD-178G BBB D7RD 5 D7*D-48*6 EEE DD482 5 D7*D-48*6 EEE
      Apart from the usual "better magic than readable style", nice approach.

      You might want to consider to memoize the compiled regexes based on length($match)

      This should be considerably faster on the long run.

      OTOH it also depends on the proportion of wildcard entries.

      If it's only a small percentage your solution should be faster than hv's

      Of course he could also distinguish between those entries with wildcards and those without.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

Re: Partial Searches Against Multiple Wildcards
by tybalt89 (Monsignor) on Dec 20, 2023 at 15:14 UTC

    I was looking for a way to solve this without generating a regex. This is the best I have found so far.

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11156323 use warnings; my %data = map { (split)[1] => $_ } <DATA>; for my $searchfor ( qw( 940 ABFD-1 ABFD1 D7RD DD482 ) ) { print "\n$searchfor\n"; print ' ' x 8, $data{$_} for grep match($searchfor, $_), sort keys +%data; } sub match { use feature 'bitwise'; my ($match, $to) = map tr/*-/\0/dr, @_; my $length = length $match; return $to =~ /(?=(.{$length})) (??{ ($match ^. $1) =~ tr!0-9A-Z\0!!c ? '(*FAIL)' : '(*ACCEPT)' }) +/gx; } __DATA__ 1 ABFD-1234 AAA 2 ABFD-178G BBB 3 F2HB-9401 AAA 4 ZDDR-00W5 DDD 5 D7*D-48*6 EEE
Re: Partial Searches Against Multiple Wildcards
by harangzsolt33 (Deacon) on Dec 20, 2023 at 07:13 UTC
    Although no one can write a better solution than hv, here is a slightly different approach. Here I try to calculate how similar a match is rather than just give a true or false value, match or no match. This programs displays a number 0-9, which means the matches can be sorted by relevance. It may mark similar items as a close match even if the letters or numbers aren't in the exact right order.

    #!/usr/bin/perl -w use strict; use warnings; my @LIST = qw(2E4H9A ACD1WB 10DLT4 EFC4Y0 1XAZ9B 3S6UDA AX79C2 CQAJ5F +DGAK9F 0JQ0A9 1A9AP9 CH9FA3); my @SEARCHSTR = qw(*A 1* 3*A A*9C*2 *9C2); print "\n\n"; foreach my $s (@LIST) { print "\n"; foreach my $ss (@SEARCHSTR) { print( " $s==$ss ", CalcMatch($s, $ss)); } } exit; ################################################## # String | v2023.12.19 # This function compares two strings and returns # a number between 0 to 9 that shows how similar # they are. Zero means there are no similarities # at all, while 9 means the two match perfectly. # The comparison is not case sensitive. # The search string may contain asterisks which # will match a string of any length. # # Usage: INTEGER = CalcMatch(String, SearchString) # sub CalcMatch { defined $_[0] && defined $_[1] or return 1; my $SL = length($_[0]); my $FL = length($_[1]); $SL && $FL or return 1; my $STR = uc(shift); my $FIND = uc(shift); my $SAME = 9; my $COUNT = CountMatchingBytes($STR, $FIND) or return 0; my $SIMILAR = int($COUNT / length($STR) * ($SAME - 1)); $FIND =~ tr|*||s; # Delete double asterisks my $ACOUNT = $FIND =~ tr|*|*|; # Count asterisks if ($ACOUNT == 0) { return $STR eq $FIND ? $SAME : $SIMILAR; } if ($ACOUNT == length($FIND)) { return $SAME; } # Locate first and last asterisk my $FIRST = index($FIND, '*'); my $LAST = ($ACOUNT > 1) ? rindex($FIND, '*') : $FIRST; # Here we're going to "cut" both strings into three parts: # 1) the part before first asterisk # 2) the part between two asterisks # 3) the part after last asterisk # And then we're going to compare each except the middle one, # because we just do a search on the middle string. # Okay, so let's compare the first section # IF the pattern doesn't start with an asterisk: if ($FIRST > 0) { substr($STR, 0, $FIRST) eq substr($FIND, 0, $FIRST) or return $SIM +ILAR; } # Let's compare the last section # IF the pattern doesn't end with an asterisk. if ($LAST < length($FIND) - 1) { my $LEN = length($FIND) - $LAST - 1; substr($STR, -$LEN) eq substr($FIND, -$LEN) or return $SIMILAR; } # Now, the part between the two asterisks must appear # somewhere in the middle of the string. if ($ACOUNT > 1 && $LAST > $FIRST) { my $MIDLEN = length($STR) - $FIRST - (length($FIND) - $LAST) + 1; return index(substr($STR, $FIRST, $MIDLEN), substr($FIND, $FIRST+1 +, $LAST - $FIRST - 1)) < 0 ? $SIMILAR : $SAME; } # More than 2 asterisk are not supported. if ($ACOUNT > 2) { return $SIMILAR; # I just noticed that this line will probably n +ot get executed ever. lol } return $SAME; } ################################################## # String | v2023.12.19 # This function counts how many of the characters in # STR1 match characters listed in STR2. # # Usage: COUNT = CountMatchingBytes(STR1, STR2) # sub CountMatchingBytes { defined $_[0] && defined $_[1] or return 0; my $P = length($_[0]) or return 0; my $C = 0; length($_[1]) or return 0; while ($P--) { index($_[1], substr($_[0], $P, 1)) < 0 or $C++; } return $C; }

    By the way, I love this thread! This was a very thought-provoking question for me. :D

Re: Partial Searches Against Multiple Wildcards
by Bod (Parson) on Dec 18, 2023 at 21:23 UTC
    Do you have any suggestions for perl-ish ways to tackle this?

    My first thought is not to use Perl to do this, as you have said you are using a database.

    Substitute the * wildcard characters for _ SQL wildcard characters and feed that directly into your SQL query.

    my $search = 'D7*D'; $search =~ /*/_/g; my ($id, $new) = $dbh->selectrow_array("SELECT rowid, new FROM table W +HERE old LIKE ?", undef, $search);

    (untested)

        Ah yes!
        Re-reading it - perhaps I did...

Re: Partial Searches Against Multiple Wildcards
by Anonymous Monk on Dec 18, 2023 at 02:05 UTC
    Since there are only a few thousand, just save a regular expression for each one:
    my $v = "D7*D-48*6"; my $r = $v; $r =~ s/-/-?/g; # Dash optional $r =~ s/\*/./g; # * any character ("D7RD" =~ m/$r/) ? ....
      Which obviously doesn't work. The pattern can be a sub-part of the string, not the other way round.

      You should test before posting.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

      A reply falls below the community's threshold of quality. You may see it by logging in.