Hi! I hope this question won't seem super lazy. I just have no idea where to start with this.

I have three files. File 1 contains stop words, File 2 contains a list of words and phrases (terms), and File 3 is the file I want to extract information from.

I want to remove the stop words from File 3, and print the remaining words in a list. This part was no problem.

The problem is that before I remove the stop words from File 3, I want to see if any of the terms from File 2 match a string in File 3. I want the longest phrase from File 2 to be searched for in File 3 first, then the second longest, and so on. Then I want to output the term surrounded by *, and remove it from being processed by the next part of the script, which removes stop words.

So let's say the files look like this and are split on the ,

File 1: I, am, the, of, and

File 2: manager of sales

File 3: I am the senior manager of sales and of marketing

Output:

senior

*manager of sales*

marketing

I'm so sorry, but I just don't even know where to start with this.

Here's the script, which right now only removes stop words, nothing else:

#!/usr/bin/perl use warnings; use strict; my %stops; my %terms; open (FILE, $ARGV[0]); while (<FILE>) { chomp; $stops{$_} = 1; } open (FILE, $ARGV[1]); while (<FILE>) { chomp; $terms{$_} = 1; } open (FILE, $ARGV[2]); while (<FILE>) { chomp; #Starting with the longest term from ARGV[1], then going to the next l +argest, and so on, if the term also exists in ARGV[2], surround it by + *, print the term, and remove the term from further processing. #after that, remove the stop words from the remainder of the file that + didn't match a string in [ARGV[1] $_ =~ tr/A-Z/a-z/; my @words = split ('[^a-z0-9]', $_); for my $word (@words) { unless ($stops{$word}++){ print "$word\n" } } }

In reply to Match strings in order of character length, and remove the string from further processing by ScarletRoxanne

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.