ScarletRoxanne has asked for the wisdom of the Perl Monks concerning the following question:
Hi! I hope this question won't seem super lazy. I just have no idea where to start with this.
I have three files. File 1 contains stop words, File 2 contains a list of words and phrases (terms), and File 3 is the file I want to extract information from.
I want to remove the stop words from File 3, and print the remaining words in a list. This part was no problem.
The problem is that before I remove the stop words from File 3, I want to see if any of the terms from File 2 match a string in File 3. I want the longest phrase from File 2 to be searched for in File 3 first, then the second longest, and so on. Then I want to output the term surrounded by *, and remove it from being processed by the next part of the script, which removes stop words.
So let's say the files look like this and are split on the ,
File 1: I, am, the, of, and
File 2: manager of sales
File 3: I am the senior manager of sales and of marketing
Output:
senior
*manager of sales*
marketing
I'm so sorry, but I just don't even know where to start with this.
Here's the script, which right now only removes stop words, nothing else:
#!/usr/bin/perl use warnings; use strict; my %stops; my %terms; open (FILE, $ARGV[0]); while (<FILE>) { chomp; $stops{$_} = 1; } open (FILE, $ARGV[1]); while (<FILE>) { chomp; $terms{$_} = 1; } open (FILE, $ARGV[2]); while (<FILE>) { chomp; #Starting with the longest term from ARGV[1], then going to the next l +argest, and so on, if the term also exists in ARGV[2], surround it by + *, print the term, and remove the term from further processing. #after that, remove the stop words from the remainder of the file that + didn't match a string in [ARGV[1] $_ =~ tr/A-Z/a-z/; my @words = split ('[^a-z0-9]', $_); for my $word (@words) { unless ($stops{$word}++){ print "$word\n" } } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Match strings in order of character length, and remove the string from further processing (updated)
by haukex (Archbishop) on May 01, 2019 at 05:24 UTC | |
|
Re: Match strings in order of character length, and remove the string from further processing
by Athanasius (Archbishop) on May 01, 2019 at 07:37 UTC |