Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks!
I have some bacterial organisms names that I want to look up into a big datafile.
My problem is that i have cases like the following:
[NAME I HAVE]: Acidovorax JS42 [NAME IN THE DATASET FILE] Acidovorax sp. JS42
So, is there a way that I can use a reg exp to match this? I tried:
$my_name="Acidovorax JS42"; $db_name="Acidovorax sp. JS42"; if($total=~/$find/) { print "OK\n"; }
and it did not work... So, is there not any way to match a string of some words that is contained in a bigger string but the words are not in order?
Thanks!

Replies are listed 'Best First'.
Re: Reg exps?
by Ratazong (Monsignor) on Jan 14, 2010 at 15:09 UTC

    So, is there not any way to match a string of some words that is contained in a bigger string but the words are not in order?

    1. split your name into various strings
    2. check if each of these strings are in your DB-string (e.g. using substr() instead of an regex)

    HTH, Rata
      Just a thought: Maybe you could (after splitting your string in substrings) do a match if you counted (set $counter for each match) e.g. 2 strings (out of 3 or whatever you think is ok) are in the "complete string" or something like that? After testing that to a larger amount of data u maybe adjust this till it fits ...


      hth
      MH
Re: Reg exps?
by eff_i_g (Curate) on Jan 14, 2010 at 15:58 UTC
    Is it safe to assume that the beginning (Acidovorax) and end (JS42) must match, but the middle (sp.) doesn't have to? If so, you can split on a space and build a regex to enact this.
      The "my_name" string is definitely included in the "db_name" string, but we can't know beforehand if the "db_name" string contains other characters or words as well. But we sure know that it contains all words from "my_name" string.

        I hope I did understand your problem. Here is my sample code. Kindly check it if the flow and output fits to the solution of your problem. If it didn't, I hope it can give you some clues. Just like Javafan said check each word of "my_name" if they all exist in one of the string of database "db_name".

        #!/usr/bin/perl -w @my_name=("Acidovorax","JS42"); #name to be searched... #...split by term/words @db_name=("Acidovorax sp. JS42", #data base of names in array "JS42 Acidovorax sp.", "JS42 sp. Acidovorax", "JS53 sp. Acidovorax", "JS42 sp. Axidovorax", "JS42Acidovorax sp. " ); my $ctr; #----compare strings in $db_name 1 at a time foreach my $db_each (@db_name){ #----search each term/word of @my_name in $db_each foreach($ctr=0; $ctr<=$#my_name; $ctr++) { #----if a term/word not found break the loop last if($db_each !~ /\b$my_name[$ctr]\b/i); } #----this will be true if inner foreach didn't break if($ctr==$#my_name+1) { print "$db_each\n"; #print the matched name } } <>;

        I have mention the "term/word" in the comment. It might be that two words in your "my_name" is considered as one term or two or more words separated by space.

        The code above does not support if a word should exist 2 or more times in any order. example: my_name = "high class bacteria f7-52 high fever". Just mentioning this condition for more flexibility to your program. It's still up to you.

        This is more or less redundant after toastbread's response and quite possibly naive but I've done it now and it seems to do what the OP wanted:
        use strict; use warnings; my $my_name="Acidovorax JS42"; my $db_name="Acidovorax sp. JS42"; my @names = split " ", $my_name; my $numel = @names; #number of words to match my $count; foreach (@names) { if ($db_name=~/^|\s\Q$_\E$|\s/) { $count++; #count matches } } print "MATCH" if $count==$numel; #report match if all words match
        --->Updated to fix regex - thx JavaFan
Re: Reg exps?
by JavaFan (Canon) on Jan 14, 2010 at 15:29 UTC
    I don't understand your code fragment. It doesn't say what $total and $find contain, and it doesn't use $my_name or $db_name.

    So, is there not any way to match a string of some words that is contained in a bigger string but the words are not in order?
    Sure there is, but that's a way of matching that has nothing to do with regexpes. A way of doing what you want is to split both strings into "words" (for whatever definition of "word" is appropriate for your problem domain - there's no universal way of splitting a string into words), then look for each of the words in your smaller string if it's in the larger string. There's a match if and only if all the words in the smaller string are found in the larger string. You should also determine when there's a match if the smaller strings contain a word more than once - may they be matched with the same word in the larger string, or must they match on different positions?
      Oh, sorry, typo mistake:
      $my_name="Acidovorax JS42"; $db_name="Acidovorax sp. JS42"; if($my_name=~/$db_name/) { print "OK\n"; }