in reply to Re: Match and Extract String with Regex
in thread Match and Extract String with Regex

Searching for (Longest Common Subsequence) should reveal nodes on the topic.
... only if they understand how/why Longest Common Subsequence is a generalisation of Longest Common Substring.

While it's an interesting area in which to educate oneself, String::LCSS does exactly what you need.

-David

  • Comment on Re^2: Match and Extract String with Regex

Replies are listed 'Best First'.
Re^3: Match and Extract String with Regex
by lima1 (Curate) on Nov 17, 2007 at 16:58 UTC
    String::LCSS does exactly what you need.

    I don't want to bash String::LCSS, but the implementation seems to be the naive O(n^3) algorithm instead of the O(mn) dynamic programming solution (http://en.wikipedia.org/wiki/Longest_common_substring_problem). A quick and dirty (and not thoroughly tested) implementation is much faster (although probably buggy).

    sub lcss2 { my ($s, $t) = @_; my $z = 0; my $m = length $s; my $n = length $t; my @S = (undef, split(//, $s)); my @T = (undef, split(//, $t)); my @L; my @ret; for my $i ( 1 .. $m ) { for my $j ( 1 .. $n ) { if ($S[$i] eq $T[$j]) { $L[$i-1][$j-1] ||= 0; $L[$i][$j] = $L[$i-1][$j-1] + 1; if ($L[$i][$j] > $z) { $z = $L[$i][$j]; @ret = (); } if ($L[$i][$j] == $z) { push @ret,substr($s, ($i-$z), $z); } } } } # warn Dumper \@L; return join '*', @ret; }
    my $s1 = '6'x 200 . 'zyzxx'; my $s2 = '5'x 200 . 'abczyzefg'; my $count = 1; timethese($count, { 'String::LCSS' => sub { String::LCSS::lcss( $s1, $s2 ) }, 'dynprog' => sub { lcss2( $s1, $s2 )}, });
    Update: Took the opportunity to learn XS and wrote String::LCSS_XS.
      Ah, well spotted.

      I don't suppose you'd like to file a bug? ;)

      -David