comment on

Hi! I hope this question won't seem super lazy. I just have no idea where to start with this.

I have three files. File 1 contains stop words, File 2 contains a list of words and phrases (terms), and File 3 is the file I want to extract information from.

I want to remove the stop words from File 3, and print the remaining words in a list. This part was no problem.

The problem is that before I remove the stop words from File 3, I want to see if any of the terms from File 2 match a string in File 3. I want the longest phrase from File 2 to be searched for in File 3 first, then the second longest, and so on. Then I want to output the term surrounded by *, and remove it from being processed by the next part of the script, which removes stop words.

So let's say the files look like this and are split on the ,

File 1: I, am, the, of, and

File 2: manager of sales

File 3: I am the senior manager of sales and of marketing

Output:

senior

*manager of sales*

marketing

I'm so sorry, but I just don't even know where to start with this.

Here's the script, which right now only removes stop words, nothing else:

#!/usr/bin/perl
use warnings;
use strict;

my %stops;
my %terms;

open (FILE, $ARGV[0]);
while (<FILE>) {
    chomp;
    $stops{$_} = 1;
}

open (FILE, $ARGV[1]);
while (<FILE>) {
    chomp;
    $terms{$_} = 1;
}

open (FILE, $ARGV[2]);
while (<FILE>) {
    chomp;

#Starting with the longest term from ARGV[1], then going to the next l
+argest, and so on, if the term also exists in ARGV[2], surround it by
+ *, print the term, and remove the term from further processing. 

#after that, remove the stop words from the remainder of the file that
+ didn't match a string in [ARGV[1]
    
    $_ =~ tr/A-Z/a-z/;
    my @words = split ('[^a-z0-9]', $_);
    for my $word (@words) {
    unless ($stops{$word}++){
        print "$word\n" 
    }
    }
}
[download]

In reply to Match strings in order of character length, and remove the string from further processing by ScarletRoxanne

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.