Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

How to match any of the strings below. In the word 'publication date',there is atypo. How can I match all the typos also in the matching string
#!/usr/bin/perl while(<DATA>){ if($_ =~ m/publication date/){ print $_; } } __DATA__ These are the dates. The publication date is 20-10-09 The publicatiion dateee is 20-05-08 The publication dat is 10-06-07 The publication date's are 20-10-09,29-02-08

Replies are listed 'Best First'.
Re: Match typo
by almut (Canon) on Sep 11, 2009 at 05:43 UTC

    Unfortunately, this is a difficult problem... because what you want is fuzzy matching, and regex engines don't make it easy to do fuzzy matching.

    You could try a different approach, for example, split the string into appropriate segments and then calculate some similarity measure to the target words, and accept/reject based on some threshold. There are some modules out there that might assist you with that, e.g. String::Similarity or Text::Levenshtein.

Re: Match typo
by Your Mother (Archbishop) on Sep 11, 2009 at 06:46 UTC

    In addition to almut's suggestions you might look at String::Approx. I've used it in the distant past and liked it. I have no idea if it's still as good as the other options for what you're trying to do.

      How to check in a line words "publish" and "date" exists or not. I have to print the "matched" if publish and date both exists in line
      while(<DATA>){ $ =~ m/publish|date/; } __DATA__ The book was published on the date "20-08-2009".

        If you want an exact match, that's easy:

        while (<DATA>){ print "matched\n" if /publish/ and /date/; }

        But note that this would also match "the date was published on"...

        If the order in which the words must occur is relevant, you could do

        while (<DATA>){ print "matched\n" if /publish.*date/; }
        A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Match typo
by ikegami (Patriarch) on Sep 11, 2009 at 05:53 UTC

    How to match any of the strings below.

    Match any:

    #!/usr/bin/perl while (<DATA>){ chomp; if ($. == 1) { print "'$_' matched\n"; } }

    Match all:

    #!/usr/bin/perl while (<DATA>){ chomp; if (1) { print "'$_' matched\n"; } }

    You'll probably have to be more specific. Like by saying what you want to not match.

    How can I match all the typos

    Start by making a list of all the typos?

    You might also be interested in finding the distance between the string being tested against a reference string. That's the name of insertions, changes and deletions needed to turn one of the strings into the other. The lower the distance, the more similar they are.

Re: Match typo
by JavaFan (Canon) on Sep 11, 2009 at 10:46 UTC
    At which moment becomes a typo no longer a typo, but a different word?
Re: Match typo
by bichonfrise74 (Vicar) on Sep 11, 2009 at 17:41 UTC
    Another possible solution would be to look at Text::Soundex , although I have not used it yet.
Re: Match typo
by ww (Archbishop) on Sep 11, 2009 at 11:16 UTC

    Terrible spec: title and body content conflict. - -, and then change, repeatedly, in later nodes.

    Working from the title, however,

    #!/usr/bin/perl use strict; use warnings; # 794700 while(<DATA>){ # if($_ =~ m/\bpublication date\b/){ if($_ =~ m/\bpublication date\s/){ print $_; } } __DATA__ These are the dates. The publication date is 20-10-09 The publicatiion dateee is 20-05-08 The publication dat is 10-06-07 The publication date's are 20-10-09,29-02-08

    Output with \b after "date" (incorrectly accepts "date's" which should be "dates" in this --plural-- context.)

    perl 794700.pl The publication date is 20-10-09 The publication date's are 20-10-09,29-02-08

    Output with \s after "date"

    perl 794700.pl The publication date is 20-10-09

    If one wishes to accept both "date" and "dates" (eg:

    The publication date is 20-10-09 The publication dates are 20-10-09, 29-02-08)

    alternation could be an answer:

    if($_ =~ m/\bpublication date\s|\bpublication dates\s/){

    Please see perldoc perlre (perlre) and company.

Re: Match typo
by graff (Chancellor) on Sep 11, 2009 at 23:20 UTC
    If you have (or get) the GNU Aspell library, you can use Text::Aspell to locate words that would be considered typos according to the Aspell dictionary:
    use strict; use Text::Aspell; my $suggest = ( @ARGV and $ARGV[0] eq '-s' ) ? shift : ''; my $speller = Text::Aspell->new or die "dang it\n"; while (<>) { for my $word ( grep /^[a-z]+$/i, split ) { next if ( $speller->check( $word )); # skip if word is in dic +tionary print "misspelled word: $word\n"; if ( $suggest ) { my @suggestions = $speller->suggest( $word ); my $advice = ( @suggestions ) ? join( " ", "Might be one of:", @suggestions ) : "No idea what that should be."; print " $advice\n\n"; } } }
    There's even a way to specify a word list of your own choice to serve as the "dictionary" -- there's some info about this in the Text::Aspell manual.