Re: Common Substrings

After some benchmarking...

I will use the xor approach (fast!). I stand corrected about substr(), it is much faster than unpack, so I will also use it. Last but not least, using pos() will help as well (even if the tests were ambiguous between length() and pos(), to my mind the code gains in expressiveness).

Thanks to you all!

For completeness, I post my sub below. It is the backend of another that creates relative paths for a fixlinks util.

After reading the chapter on unicode of the camel book, I conclude that this sub will do the rigth thing whatever locale, since Perl strings are either latin1 or utf8 encoded, and these are both ascii transparent, the requirement to look for a slash after deciding $pos does the trick. However, it is a shame that it cannot be generalized to a common substring function like the one I presented above.

sub SLASH() { 47 }

sub _common_path {
   if ( $_[0] eq $_[1] ) {
      return(
         $_[0],      # all path components are common
         "",         # nothing remains of first
         "",         # nothing remains of second
      );
   } else {
      use bytes;
      my ( $len0, $len1 ) = ( length($_[0]), length($_[1]) );
      # find the offset of the first byte that differs
      my $pos = $_[0] ^ $_[1];
      $pos =~ m/[^\x00]/g;
      $pos = pos($pos) - 1;
      # if some bytes are common but the last one wasn't the separator
      # we must decide which path components are common
      if ( $pos > 0 && vec($_[0], ($pos - 1), 8) != SLASH ) {
         # check if first path is just longer than the second
         if ( $pos == $len1 && vec($_[0], $pos, 8) == SLASH ) {
            $pos++;
            return(
               substr($_[0], 0, $pos),    # common path with slash
               substr($_[0], $pos),       # extra in first
               "",                        # nothing remains of second
            );
         }
         # check if second path is just longer than the first
         if ( $pos == $len0 && vec($_[1], $pos, 8) == SLASH ) {
            $pos++;
            return(
               substr($_[1], 0, $pos),    # common path with slash
               "",                        # nothing remains of first
               substr($_[1], $pos),       # extra in second
            );
         }
         # otherwise, rewind until last common path component
         while ( $pos > 0 ) {
            $pos--;
            if ( vec($_[0], $pos, 8) == SLASH ) {
               $pos++;                    # and keep the common slash
               last;
            }
         }
      }
      return(
         substr($_[0], 0, $pos), # common path components (with slash)
         substr($_[0], $pos),    # extra in first
         substr($_[1], $pos),    # extra in second
      );
   }
}
[download]

Comment on Re: Common Substrings Download Code

Replies are listed 'Best First'.
Re^2: Common Substrings by salva (Canon) on Nov 15, 2005 at 15:32 UTC
The scalars returned by `substr` inside a block where `use bytes` holds, never have the utf8 flag set. For instance: `$ perl -de 1 ... DB<43> $a="\x{1234}/foo" DB<44> x ord substr $a, 0, 1 0 4660 DB<45> sub bsubstr { use bytes; substr $_[0], $_[1], $_[2] } DB<46> x ord bsubstr $a, 0, 1 0 225 DB<47> x ord bsubstr $a, 1, 1 0 136 DB<48> x ord bsubstr $a, 2, 1 0 180 DB<49> x ord bsubstr $a, 3, 1 0 47` [download]	[reply] [d/l] [select]
Re^3: Common Substrings by Anonymous Monk on Nov 15, 2005 at 15:39 UTC
Does that mean that I simply need `no bytes;` before returning the values?	[reply] [d/l]
Re^4: Common Substrings by salva (Canon) on Nov 15, 2005 at 16:00 UTC
no, because then, the offsets would be wrong... maybe you could use the `utf8::upgrade` function to mark the returned strings as unicode but your code would become really ugly and unmaintainable. If you are looking for common file paths why don't you use something simpler like: `sub diffpath_ix { my ($a, $b) = @_; my $last = 0; while ( $a=~m{(\G[^/]*/)}g ) { if ( substr($b, $last, length $1) eq $1 ) { $last = pos $a } else { return $last } } if (substr($a, $last) eq substr($b, $last)) { return length $a } return $last; }` [download]	[reply] [d/l] [select]


Just another Perl shrine
	PerlMonks