Preserve original text formatting.

larsb has asked for the wisdom of the Perl Monks concerning the following question:

#!/usr/bin/perl
#This script matches words that are repeated within <num> words and ma
+rks them and prints out the original text with repeated words marked.
use strict;
use warnings;
print "Enter the name of the file to check: ";
my $inputfile = <>;
#Define how many words between matching a repeat.
my $numwords = 40;
open(TEXTFILE, $inputfile);
my @importedtextfile = <TEXTFILE>;
close(TEXTFILE);
my $counter = 0;
my @words;
#Split the sentences imported.
foreach (@importedtextfile) {
    my @tempsplit = split(/\s+/,  $_);
    foreach(@tempsplit) {
        push @words, $_;
    }
}
my @filteredlist = @words;
my $wordslength = scalar(@words);
$counter = 0;
foreach(@words) {
    my $i = 0;
    my $currentword = $_;
    $currentword =~ s/[\,\.]//g;
    #Start counting one word "right" of the one you are trying to matc
+h.
    my $startposition = $counter + 1;
    #Check <num> words and make sure you don't check empty strings in 
+array.
    while ($i < $numwords and $counter + $i < $wordslength) {
        if ($counter + $i + 1 < $wordslength) {
            my $matchword = $words[$startposition];
            $matchword =~ s/[\,\.]//g;
            #Match and replace in new array.
            if ($currentword =~ /\b$matchword\b/i) {
            $filteredlist[$startposition] = "*".$words[$startposition]
+."*";
            }
        }
        ++$startposition;
        ++$i;
    }
    ++$counter;
}
my $printedlist = join(" ", @filteredlist);
print "$printedlist\n";
[download]

This script reads a specified text-file from input and then checks for reoccurring words within specified numbers of words.

As an example. To be or not to be. Would result in "To be or not *to* be." at first pass, and second "To be or not *to* *be*.

The printout though does not preserve the original formatting of the text with line-breaks and such. I'm wondering if that is possible somehow, i guess it's all related to the split or the join.

Comment on Preserve original text formatting. Download Code

Replies are listed 'Best First'.

Re: Preserve original text formatting.
by hippo (Archbishop) on Sep 10, 2015 at 13:27 UTC

i guess it's all related to the split or the join

I would guess so too. You split on /\s+/ but you join with " " so there's clearly going to be a change in the whitespace there if you have anything other than single whitespaces between the words in your input.

Since you check for \b anyway, why not split on that instead and then check each field for an alpha character before doing your comparison? You could then join on ''.

Alternatively look into using one of the many text parsing modules.

[reply]
[d/l]
[select]

Re^2: Preserve original text formatting.

by larsb (Novice) on Sep 10, 2015 at 13:37 UTC

Thanks a lot, certainly seems like a better solution. Will for sure try it out.

I'm quite new to perl and certainly a novice regarding regex so I'm sure there is plenty mistakes on that front.

[reply]

Re^3: Preserve original text formatting.

by locked_user sundialsvc4 (Abbot) on Sep 10, 2015 at 13:56 UTC

If you say that you “are new at this,” then you are certainly off to a good start. :-)

hippo, are there any particular text-parsing modules that you might suggest, to do a thing like this?

Re^4: Preserve original text formatting.

by hippo (Archbishop) on Sep 10, 2015 at 16:46 UTC

Re^4: Preserve original text formatting.

by larsb (Novice) on Sep 10, 2015 at 16:06 UTC

Re: Preserve original text formatting.
by Athanasius (Cardinal) on Sep 10, 2015 at 14:17 UTC

Hello larsb, and welcome to the Monastery!

Another approach is to modify the text of the file by using s///g to replace each repeated word with its marked version. The following script shows one way to do this (but it doesn’t take into account the maximum number of words allowed between repeats):

#! perl
use strict;
use warnings;

my $file = do { local $/; <DATA>; };        # Slurp the whole file int
+o a string

# Make a hash that maps each word to its word count in the file
my %words;
 ++$words{lc $_} for split /\W+/, $file;

# Construct a regular expression to match each word which appears at l
+east twice
my $str = join '|', grep { $words{$_} > 1 } keys %words;
my $re  = qr{($str)}i;

$words{$_} = 0 for keys %words;             # Re-set the word counts t
+o zero

# Mark the second and subsequent occurrences of each word
$file =~ s{$re}{ $words{lc $1}++ ? "*$1*" : $1 }eg;

print $file;

__DATA__
To be or not to be; that is to be the question.
Is that the question? Yes!
[download]

Output:

 0:13 >perl 1369_SoPW.pl
To be or not *to* *be*; that is *to* *be* the question.
*Is* *that* *the* *question*? Yes!

 0:13 >
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Preserve original text formatting.

by Not_a_Number (Prior) on Sep 10, 2015 at 18:07 UTC

Oops!

__DATA__

To be or not to be?
Today Glastonbury tomorrow Brighton!
[download]

Output:

To be or not *to* *be*?
*To*day Glas*to*nbury *to*morrow Brigh*to*n!
[download]

:-)

Update: Worse (and incomprehensibly to me), replacing __DATA__ with:

To be or not?
Today Glastonbury tomorrow Brighton!
[download]

gives:

T**o** **b**e** **o**r** **n**o**t**?**
**T**o**d**a**y** **G**l**a**s**t**o**n**b**u**r**y** **t**o**m**o**r*
+*r**o**w**
 **B**r**i**g**h**t**o**n**!**
[download]

[reply]
[d/l]
[select]

Re^3: Preserve original text formatting.

by Athanasius (Cardinal) on Sep 11, 2015 at 02:48 UTC

Hello Not_a_Number,

Two excellent catches!

The first problem occurs because the regex is matching parts (substrings) of words. It can be fixed by adding a test for word boundaries (\b) before and after each word in the regex. The second problem occurs when there are no repeated words at all, in which case the regex becmes (?^i:()), which matches the empty string. It can be fixed by an explicit test. Here is a revised script:

#! perl
use strict;
use warnings;
use List::Util qw(any);

my $file = do { local $/; <DATA>; };        # Slurp the whole file int
+o a string

# Make a hash that maps each word to its word count in the file
my %words;
 ++$words{lc $_} for split /\W+/, $file;

# Construct a regular expression to match each word which appears at l
+east twice
my $re;

if (any { $_ > 1 } values %words)
{
    my $str = '\\b' . join('\\b|\\b', grep { $words{$_} > 1 } keys %wo
+rds) . '\\b';
    $re = qr{($str)}i;
}

$words{$_} = 0 for keys %words;             # Re-set the word counts t
+o zero

# Mark the second and subsequent occurrences of each word
$file =~ s{$re}{ $words{lc $1}++ ? "*$1*" : $1 }eg if $re;

print $file;
[download]

Thanks!

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Preserve original text formatting.

by larsb (Novice) on Sep 10, 2015 at 16:09 UTC

Thanks for the welcome and the data.

I have to admit though that this is way above my head, maybe in a year or so i will understand. Very nice to see such short solutions compared to mine.

[reply]

Re: Preserve original text formatting.
by Anonymous Monk on Sep 10, 2015 at 14:48 UTC

#!/usr/bin/perl

# http://perlmonks.org/?node_id=1141535

use strict;
use warnings;

my $numwords = 40;

$_ = "To be or not to be.\n"; # or read from file...

my @previous; # holds the $numwords previous words
s/(\w+)/
  my $match = grep lc $1 eq lc, @previous;
  push @previous, $1;
  @previous > $numwords and shift @previous;
  $match ? "*$1*" : $1 /ge;

print;
[download]

[reply]
[d/l]

Re^2: Preserve original text formatting.

by larsb (Novice) on Sep 10, 2015 at 16:11 UTC

Thanks, that regex really opened up my eyes to what can be accomplished.

[reply]

Re^3: Preserve original text formatting.

by locked_user sundialsvc4 (Abbot) on Sep 10, 2015 at 22:05 UTC

Aye ... and the Perl language will never cease in doing just-exactly that. If you ever pondered, “so, what is all the Fuss about?,” well, “here it is.”

A reply falls below the community's threshold of quality. You may see it by logging in.