Re: Preserve original text formatting.

Hello larsb, and welcome to the Monastery!

Another approach is to modify the text of the file by using s///g to replace each repeated word with its marked version. The following script shows one way to do this (but it doesn’t take into account the maximum number of words allowed between repeats):

#! perl
use strict;
use warnings;

my $file = do { local $/; <DATA>; };        # Slurp the whole file int
+o a string

# Make a hash that maps each word to its word count in the file
my %words;
 ++$words{lc $_} for split /\W+/, $file;

# Construct a regular expression to match each word which appears at l
+east twice
my $str = join '|', grep { $words{$_} > 1 } keys %words;
my $re  = qr{($str)}i;

$words{$_} = 0 for keys %words;             # Re-set the word counts t
+o zero

# Mark the second and subsequent occurrences of each word
$file =~ s{$re}{ $words{lc $1}++ ? "*$1*" : $1 }eg;

print $file;

__DATA__
To be or not to be; that is to be the question.
Is that the question? Yes!
[download]

Output:

 0:13 >perl 1369_SoPW.pl
To be or not *to* *be*; that is *to* *be* the question.
*Is* *that* *the* *question*? Yes!

 0:13 >
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

Comment on Re: Preserve original text formatting. Select or Download Code

Replies are listed 'Best First'.

Re^2: Preserve original text formatting.
by Not_a_Number (Prior) on Sep 10, 2015 at 18:07 UTC

Oops!

__DATA__

To be or not to be?
Today Glastonbury tomorrow Brighton!
[download]

Output:

To be or not *to* *be*?
*To*day Glas*to*nbury *to*morrow Brigh*to*n!
[download]

:-)

Update: Worse (and incomprehensibly to me), replacing __DATA__ with:

To be or not?
Today Glastonbury tomorrow Brighton!
[download]

gives:

T**o** **b**e** **o**r** **n**o**t**?**
**T**o**d**a**y** **G**l**a**s**t**o**n**b**u**r**y** **t**o**m**o**r*
+*r**o**w**
 **B**r**i**g**h**t**o**n**!**
[download]

[reply]
[d/l]
[select]

Re^3: Preserve original text formatting.

by Athanasius (Cardinal) on Sep 11, 2015 at 02:48 UTC

Hello Not_a_Number,

Two excellent catches!

The first problem occurs because the regex is matching parts (substrings) of words. It can be fixed by adding a test for word boundaries (\b) before and after each word in the regex. The second problem occurs when there are no repeated words at all, in which case the regex becmes (?^i:()), which matches the empty string. It can be fixed by an explicit test. Here is a revised script:

#! perl
use strict;
use warnings;
use List::Util qw(any);

my $file = do { local $/; <DATA>; };        # Slurp the whole file int
+o a string

# Make a hash that maps each word to its word count in the file
my %words;
 ++$words{lc $_} for split /\W+/, $file;

# Construct a regular expression to match each word which appears at l
+east twice
my $re;

if (any { $_ > 1 } values %words)
{
    my $str = '\\b' . join('\\b|\\b', grep { $words{$_} > 1 } keys %wo
+rds) . '\\b';
    $re = qr{($str)}i;
}

$words{$_} = 0 for keys %words;             # Re-set the word counts t
+o zero

# Mark the second and subsequent occurrences of each word
$file =~ s{$re}{ $words{lc $1}++ ? "*$1*" : $1 }eg if $re;

print $file;
[download]

Thanks!

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Preserve original text formatting.
by larsb (Novice) on Sep 10, 2015 at 16:09 UTC

Thanks for the welcome and the data.

I have to admit though that this is way above my head, maybe in a year or so i will understand. Very nice to see such short solutions compared to mine.

[reply]