Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

hello monks,

I want to comapre 2 strings and extract the longest common substring based on words.

I have used

my $seg1 = "The man who likes reading books and writing poems." <p>my $seg1 = "The man who likes reading big books and poems." <p>lcss( "$seg1", "$seg2" );

but the result is based on strings. The output is:

The man who likes reading b

Is any algorithm in perl that returns

The man who likes reading?

I would be also happy having sommethins like:

The man who likes reading * * * poems

Replies are listed 'Best First'.
Re: Find substring based on words and not in charachters
by Corion (Patriarch) on Dec 02, 2014 at 16:14 UTC

    You can feed Algorithm::Diff words instead of lines and it will return you the longest common subsequence through the LCS function.

Re: Find substring based on words and not in charachters (Updated.)
by BrowserUk (Patriarch) on Dec 02, 2014 at 17:06 UTC

    Update: Added minor optimisation. Update2: Rolled the optimisation into the while loop.

    Something like this?:

    #! perl -slw use strict; my $seg1 = "The man who likes reading books and writing poems."; my $seg2 = "The man who likes reading big books and poems."; my $best = ''; while( length( $seg1 ) > length( $best ) ) { while( $seg1 =~ m[(?!\s)(?=(\b.+\b)(?!\s))]g ) { my $bit = $1; $best = $bit if $seg2 =~ m[\Q$bit] and length( $bit ) > length +( $best ); } $seg1 =~ s[(?:\s|^)\S+$][]; } print $best; __END__ [17:04:21.46] C:\test>junk39 The man who likes reading

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I'd compare word count instead of length.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        The OP certainly could go that way, but I see two problems with it:

        1. Deciding upon a definition for a "word".
        2. Are a few short words more meaningful than 1 less long one?

          Eg. "on the way to" -v- "the Riechstag Bureau"?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        OP states he is looking for longest string; he wants to avoid getting parts of words, such as the last 'b' the lcss algorithm finds.

        1 Peter 4:10

      Hi,

      Thanks a lot for your code! It works perfect :)