Match strings in order of character length, and remove the string from further processing

ScarletRoxanne has asked for the wisdom of the Perl Monks concerning the following question:

Hi! I hope this question won't seem super lazy. I just have no idea where to start with this.

I have three files. File 1 contains stop words, File 2 contains a list of words and phrases (terms), and File 3 is the file I want to extract information from.

I want to remove the stop words from File 3, and print the remaining words in a list. This part was no problem.

The problem is that before I remove the stop words from File 3, I want to see if any of the terms from File 2 match a string in File 3. I want the longest phrase from File 2 to be searched for in File 3 first, then the second longest, and so on. Then I want to output the term surrounded by *, and remove it from being processed by the next part of the script, which removes stop words.

So let's say the files look like this and are split on the ,

File 1: I, am, the, of, and

File 2: manager of sales

File 3: I am the senior manager of sales and of marketing

Output:

senior

*manager of sales*

marketing

I'm so sorry, but I just don't even know where to start with this.

Here's the script, which right now only removes stop words, nothing else:

#!/usr/bin/perl
use warnings;
use strict;

my %stops;
my %terms;

open (FILE, $ARGV[0]);
while (<FILE>) {
    chomp;
    $stops{$_} = 1;
}

open (FILE, $ARGV[1]);
while (<FILE>) {
    chomp;
    $terms{$_} = 1;
}

open (FILE, $ARGV[2]);
while (<FILE>) {
    chomp;

#Starting with the longest term from ARGV[1], then going to the next l
+argest, and so on, if the term also exists in ARGV[2], surround it by
+ *, print the term, and remove the term from further processing. 

#after that, remove the stop words from the remainder of the file that
+ didn't match a string in [ARGV[1]
    
    $_ =~ tr/A-Z/a-z/;
    my @words = split ('[^a-z0-9]', $_);
    for my $word (@words) {
    unless ($stops{$word}++){
        print "$word\n" 
    }
    }
}
[download]

Comment on Match strings in order of character length, and remove the string from further processing Download Code

Replies are listed 'Best First'.

Re: Match strings in order of character length, and remove the string from further processing (updated)
by haukex (Archbishop) on May 01, 2019 at 05:24 UTC

I would suggest not splitting the string into an array (Update: at least not by words), since this kind of matching sounds like it'd be easier to do with two regexes - one for the stop words, and one for the phrases. Have a look at my node Building Regex Alternations Dynamically. Then, I'd suggest you process the input string piece by piece, because that should make your requirement of not replacing stop words inside the longer matches possible - see "Global Matching" in perlretut, and a more complex example of the usage of m/\G.../gc in perlop under "\G assertion".

Update: Here's a solution that takes a simpler route than m/\G.../gc, using split:

use warnings;
use strict;

my %stops = map {$_=>1} qw/ I am the of and /;
my %terms = ( 'manager of sales' => 1 );
my $str = 'I am the senior manager of sales and of marketing';

my ($re_stops) = map {qr/\b(?:$_)\b/i} join '|', map {quotemeta}
    sort { length $b <=> length $a or $a cmp $b } keys %stops;
my ($re_terms) = map {qr/\b(?:$_)\b/i} join '|', map {quotemeta}
    sort { length $b <=> length $a or $a cmp $b } keys %terms;

my @s = split /($re_terms)/, $str;
for my $i (0..$#s) {
    if ($i%2) {
        print "*", $s[$i], "*\n";
    }
    else {
        $s[$i] =~ s/$re_stops//g;
        $s[$i] =~ s/^\s+|\s+$//g;
        print $s[$i], "\n";
    }
}

__END__

senior
*manager of sales*
marketing
[download]

Update 2: Added the case-insenstive matching flag /i to the regexes and fixed a typo in the text.

[reply]
[d/l]
[select]

Re: Match strings in order of character length, and remove the string from further processing
by Athanasius (Archbishop) on May 01, 2019 at 07:37 UTC

Hello ScarletRoxanne,

Here’s an approach which replaces each phrase/term with a temporary marker, then removes stopwords, then replaces the markers with their original terms:

use strict;
use warnings;
use Const::Fast;
use Data::Dump;

const my $DELIM => '\034';

my %stops =  map { lc $_ => 1 } qw( I am the of and you are );
my @terms =  ('manager of sales', 'chairman of the board');
   @terms =  sort { length $b <=> length $a } @terms;   # longest firs
+t
my $file3 = 'I am the Senior Manager of Sales and of Marketing. ' .
            'You are the Chairman of the Board of Directors.';
   $file3 =~ tr/A-Z/a-z/;               # convert to lower case

   # replace terms with temporary markers
   $file3 =~ s{$terms[$_]}{$DELIM$_$DELIM}gi for 0 .. $#terms;
my @file3 =  split /\s+/, $file3;
   @file3 =  grep { ! exists $stops{$_} } @file3;

for my $entry (@file3)
{
    if ($entry =~ /\Q$DELIM\E(\d+)\Q$DELIM\E/)
    {
        $entry = '*' . $terms[$1] . '*';
    }
    else
    {
        $entry =~ s{[[:punct:]]}{}g;    # remove punctuation
    }
}

print "$_\n" for @file3;
[download]

Output:

17:35 >perl 1997_SoPW.pl
senior
*manager of sales*
marketing
*chairman of the board*
directors

17:35 >
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]