TravelAddict has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I need to compare a long list of English US sentences with the same list of sentences in English UK. The goal is to build a dictionary of different words/expressions used, for example "color" vs. "colour". The difficulty is that there is not always a 1 to 1 correspondence, so one word in US English might be more than one word in UK English, for example "round trip (ticket)" vs. "return (ticket)". I feel that I'm reinventing the wheel especially that I've seen some diff modules available, however they will compare and flag whole lines instead of just subsets. Is there a module or an easy way to compare the strings and extract only what's different? Thanks!
  • Comment on Comparing strings to extract the different words/expressions

Replies are listed 'Best First'.
Re: Comparing strings to extract the different words/expressions
by Anonymous Monk on Jun 09, 2015 at 22:14 UTC

    Algorithm::Diff will operate on array elements. I use it to colorize differences inside lines.

    If you will post some test cases, I can show you how it works.

    Here's my simple colorizer, if that helps.

    #!/usr/bin/perl use Algorithm::Diff qw(traverse_sequences); use Term::ANSIColor; use strict; use warnings; my @from = split //, shift // 'this is the left string'; my @to = split //, shift // 'this is the right string'; traverse_sequences( \@from, \@to, { MATCH => sub {print $from[shift()]}, DISCARD_A => sub {print color('red'), $from[shift()], color 'reset'} +, DISCARD_B => sub {print color('green'), $to[pop()], color 'reset'}, } ); print "\n";
Re: Comparing strings to extract the different words/expressions
by Athanasius (Archbishop) on Jun 13, 2015 at 15:43 UTC

    Hello TravelAddict,

    I’ve been thinking on and off about this problem, and it occurred to me that you can leverage a module like String::Diff or Algorithm::Diff if you map the words in your sentences onto character tokens and then apply one of the standard string diff methods to the resulting token strings. Here is a proof of concept:

    #! perl use strict; use warnings; use Data::Dump; use String::Diff 'diff'; my @pairs = ( [ 'This apple is colored red.', 'This apple is coloured red.', ], [ 'I need to buy a round trip ticket.', 'I need to buy a return ticket.', ], [ 'Jack rode the elevator to the top floor.', 'Jack took the lift to the top floor.', ], ); my %diffs; for my $pair (@pairs) { my (@sent, %ids, @seq); my $id = 33; my ($seq0, $seq1); for my $i (0, 1) { $sent[$i] = $pair->[$i] =~ s/'s?\b//gr; # Remove +possessives $sent[$i] =~ s/[-[\].,;:'"(){}<>]//g; # Remove +all other punctuation for (split /\s+/, $sent[$i]) { $ids{$_} //= $id++; $seq[$i] .= chr($ids{$_}); } } my %lookup = reverse %ids; my ($old, $new) = diff($seq[0], $seq[1]); my @old = $old =~ /\[(.+?)\]/g; my @new = $new =~ /\{(.+?)\}/g; while (@old && @new) { my $o = join(' ', map { $lookup{ord $_} } split(//, shift @old +)); my $n = join(' ', map { $lookup{ord $_} } split(//, shift @new +)); $diffs{$o} = $n; } } dd \%diffs;

    Output:

    1:42 >perl 1269_SoPW.pl { "colored" => "coloured", "elevator" => "lift", "rode" => "took", "round trip" => "return", } 1:42 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Comparing strings to extract the different words/expressions
by Anonymous Monk on Jun 13, 2015 at 21:06 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1129709 use Algorithm::Diff qw(traverse_sequences); use strict; use warnings; my (@old, @new, %dictionary); sub addtodictionary { @old || @new and $dictionary{"@old"}{"@new"}++, @old = @new = (); } while(<DATA>) { my @from = split; my @to = split ' ', <DATA> // 'unmatched line'; traverse_sequences( \@from, \@to, { MATCH => sub {addtodictionary()}, DISCARD_A => sub {push @old, $from[shift()]}, DISCARD_B => sub {push @new, $to[pop()]}, } ); addtodictionary(); } use YAML; print Dump \%dictionary; __DATA__ This apple is colored red. This apple is coloured red. I need to buy a round trip ticket. I need to buy a return ticket. Jack rode the elevator to the top floor. Jack took the lift to the top floor.

    produces (note that this mapping *can* be "one to many")

    --- colored: coloured: 1 elevator: lift: 1 rode: took: 1 round trip: return: 1
      Hi, Wow! This is fantastic, exactly what I was looking for! Thanks very much! I think it will be quite easy to adapt to read the strings from external files and format the output, but this is really great! Thanks again for your help! :-)