in reply to Re: Preserve original text formatting.
in thread Preserve original text formatting.

Oops!

Try replacing __DATA__ with:

To be or not to be? Today Glastonbury tomorrow Brighton!

Output:

To be or not *to* *be*? *To*day Glas*to*nbury *to*morrow Brigh*to*n!

:-)

Update: Worse (and incomprehensibly to me), replacing __DATA__ with:

To be or not? Today Glastonbury tomorrow Brighton!

gives:

T**o** **b**e** **o**r** **n**o**t**?** **T**o**d**a**y** **G**l**a**s**t**o**n**b**u**r**y** **t**o**m**o**r* +*r**o**w** **B**r**i**g**h**t**o**n**!**

Replies are listed 'Best First'.
Re^3: Preserve original text formatting.
by Athanasius (Archbishop) on Sep 11, 2015 at 02:48 UTC

    Hello Not_a_Number,

    Two excellent catches!

    The first problem occurs because the regex is matching parts (substrings) of words. It can be fixed by adding a test for word boundaries (\b) before and after each word in the regex. The second problem occurs when there are no repeated words at all, in which case the regex becmes (?^i:()), which matches the empty string. It can be fixed by an explicit test. Here is a revised script:

    #! perl use strict; use warnings; use List::Util qw(any); my $file = do { local $/; <DATA>; }; # Slurp the whole file int +o a string # Make a hash that maps each word to its word count in the file my %words; ++$words{lc $_} for split /\W+/, $file; # Construct a regular expression to match each word which appears at l +east twice my $re; if (any { $_ > 1 } values %words) { my $str = '\\b' . join('\\b|\\b', grep { $words{$_} > 1 } keys %wo +rds) . '\\b'; $re = qr{($str)}i; } $words{$_} = 0 for keys %words; # Re-set the word counts t +o zero # Mark the second and subsequent occurrences of each word $file =~ s{$re}{ $words{lc $1}++ ? "*$1*" : $1 }eg if $re; print $file;

    Thanks!

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,