Re^2: Preserve original text formatting.

Oops!

Try replacing __DATA__ with:

To be or not to be?
Today Glastonbury tomorrow Brighton!
[download]

Output:

To be or not *to* *be*?
*To*day Glas*to*nbury *to*morrow Brigh*to*n!
[download]

:-)

Update: Worse (and incomprehensibly to me), replacing __DATA__ with:

To be or not?
Today Glastonbury tomorrow Brighton!
[download]

gives:

T**o** **b**e** **o**r** **n**o**t**?**
**T**o**d**a**y** **G**l**a**s**t**o**n**b**u**r**y** **t**o**m**o**r*
+*r**o**w**
 **B**r**i**g**h**t**o**n**!**
[download]

Comment on Re^2: Preserve original text formatting. Select or Download Code

Replies are listed 'Best First'.

Re^3: Preserve original text formatting.
by Athanasius (Cardinal) on Sep 11, 2015 at 02:48 UTC

Hello Not_a_Number,

Two excellent catches!

The first problem occurs because the regex is matching parts (substrings) of words. It can be fixed by adding a test for word boundaries (\b) before and after each word in the regex. The second problem occurs when there are no repeated words at all, in which case the regex becmes (?^i:()), which matches the empty string. It can be fixed by an explicit test. Here is a revised script:

#! perl
use strict;
use warnings;
use List::Util qw(any);

my $file = do { local $/; <DATA>; };        # Slurp the whole file int
+o a string

# Make a hash that maps each word to its word count in the file
my %words;
 ++$words{lc $_} for split /\W+/, $file;

# Construct a regular expression to match each word which appears at l
+east twice
my $re;

if (any { $_ > 1 } values %words)
{
    my $str = '\\b' . join('\\b|\\b', grep { $words{$_} > 1 } keys %wo
+rds) . '\\b';
    $re = qr{($str)}i;
}

$words{$_} = 0 for keys %words;             # Re-set the word counts t
+o zero

# Mark the second and subsequent occurrences of each word
$file =~ s{$re}{ $words{lc $1}++ ? "*$1*" : $1 }eg if $re;

print $file;
[download]

Thanks!

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]