Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to search initially for a string on an html file for:
Joe Fred Software Inc.
The string is found many times like:
Fred Joe Inc Software Company or Software Fred Joe Co.
Here is the regular expression to search for the string. My question is how to make my regular expression to be as flexible as possible to find any combination of the string and print all of them.
Thanks for the big help!!!

if ($file =~ /(\<font.*\>)($string)(\<\/font\>)/gi){$t1=$2;print "$t1";}

Replies are listed 'Best First'.
Re: Regular Expression Help
by kvale (Monsignor) on Sep 05, 2002 at 16:03 UTC
    Regexes won't automatically search for all permutations of words, so you must to it yourself:
    @words = split $string; $found = 1; foreach my $word (@words) { unless ($file =~ /$word/ ) { $found = 0; $last; } } print "Found: $string" if $found;
    Dealing with synonyms for comapny, co. and inc. are a bit more difficult. You could either strip these out of the string or special case them in the "and" loop.

    -Mark

Re: Regular Expression Help
by Django (Pilgrim) on Sep 05, 2002 at 16:52 UTC

    You could approach that problem from at least two sides:

    • Find a pattern that tolerates all variations, but is exact enough to exclude anything you don't want to find. Or:
    • Make a hash with all possible variations as keys and check if your match exists in the hash.
    The first approach might work if you can extract a common, unambigous string from all variations. For example if "Fred Joe" works as such an identifier, you could just look for /<font.*?>(.*?Fred Joe.*?)<\/font>/ig

    The second approach makes only sense when you know all variations. It could look like the following:

    my %Freds = ( 'Fred Joe Inc Software Company' => 1, 'Software Fred Joe Co.' => 1, # ... ) my @Texts = $file =~ /<font.*?>(.*?)<\/font>/ig; foreach (@Texts) { print $_ if $Freds{$_}; }

    ~Django
    "Why don't we ever challenge the spherical earth theory?"

Re: Regular Expression Help
by MZSanford (Curate) on Sep 05, 2002 at 15:49 UTC
Re: Regular Expression Help
by Basilides (Friar) on Sep 05, 2002 at 16:16 UTC
    Do you want to match on parts of the string, eg just "Joe Fred," as well as combinations involving *all* the words? If so, here's a version which implements japhy's wicked combination routine from a query of mine here:
    my $file = "xxxxxxxxx Joe xxxxxxx Fred xxxxxxxx Joe Fred Software xxxx +xxx Inc Software Fred Joe"; my $string = "Joe Fred Software Inc"; my @n = split / /, $string; my $max = 2**@n; for (my $i = 0; $i < $max; ++$i) { my ($bits, @used) = $i; while ($bits) { my $high_bit = int(log($bits)/log(2)); push @used, $n[$high_bit]; $bits &= ~(2**$high_bit); } my $match = join " ", @used; print "$match\n" if $file =~ m/$match/; }
    Hope that's the kind of thing you wanted.