Text markup confusion

paurus has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

Please help this Perl novice.

I am attempting to make 30 different matches from the input file to highlight in the output file, so I am searching for a way to automate the process, but I do not know how to create a process for Perl to identify and capitalize the chosen text.

I am wanting (if possible) to read in a reference file with the list of words I want to highlight within the main input file. Then have Perl ~HIGHLIGHT~ the matched text.

My problem is that some of the text (words) are "broken" by punctuation or spaces of an arbitrary length (as if I wanted to match ~A MY~ from the words "saw the panda myself" (using the 'a' from pand~A and the 'my' from MY~self)). It seems as though I can use Reg Exps for the single words, but for these "broken" words and for double words on a line (I want to match ~PANDA~ ~MYSELF~ in "saw the panda myself" on a single line) using reg exps with '/gi' will only select the first given match when using the '|' as a separator (i.e. '=~ /panda|myself/gi'). (Meaning the reg exp only higlights ~PANDA~ and not "myself").

I would show some of my code, but it is embarrasingly inadequate for solving this textual dilemma.

What can I do to have Perl 'intelligently' find (using the reference file) and ~HIGHLIGHT~ the "broken" and "double" words on a line? or will it be that I need to find each word myself and then hard code them into the script?

I am trying to copy the WHOLE file to a NEW file with the ~HIGHLIGHT~s in the new file. (but (I think) I just need help on the text markup part) I apologize for my confusing description of the activity, but I am not sure how to better word it. Thank you in advance for your help and wisdom.

UPDATE:

This is the code I have come up with so far:

# Open puzzle file and print ##
open(IN, "/Users/TMP/Puz.txt") or die "Can't open Puz.txt for reading:
+ $!\n";
@words = <IN>;
foreach $word(@words) {
        chomp($word);
#       print "$word\n";
        push(@words2, $word);
}
close(IN);

# open refernce file for words to find
open REF, "/Users/TMP/Ref_file.txt" or die "Can't open Ref_file.txt fo
+r reading: $!\n";
my @refs = <REF>;
foreach $ref(@refs) {
        chomp($ref);
        push(@refs2, $ref);
}

##############################

# Process array
&find_words(@words2);

# find words subroutine
sub find_words {

        while(@words2) {
                $words = shift(@words2);
                foreach $ref(@refs2) {
                        if($words =~ /$ref/gi) {
                                        print "$`","~\U$&~","$'","\n";
                        }
                }
        }
}
[download]

That's my code in a nutshell.

Thank you for all of the posts so far. They are a GREAT help!

UPDATE (FINAL):

Thank you all for the help. Here is the final code I used for this particular problem:

sub find_words {
print "Here are the answers!\n\n";

        while (@input2) {
                $in = shift(@input2);
                $in =~ s{($regexp)}{ **\U$1\E** }g;
                print "$in\n";
        }
exit;
}
[download]

Comment on Text markup confusion Select or Download Code

Replies are listed 'Best First'.
Re: Text markup confusion by tachyon (Chancellor) on Apr 15, 2004 at 05:05 UTC
You can create a file with one term, phrase of RE per line then just read it into an array and chomp it. You generate a precompiled RE as shown. Then just iterate over the infile lines. Then just print the modified data out to STDOUT or a file. I say lines but you may note the $/ = "\n\n". The $/ is the input record separator (normally \n). By setting it to two newlines we read in whole paragraphs at a time and thus can make matches across line breaks. The RE needs \b boundary conditions to make it match full words. If you want extra chars I suggest \w* for extra word chars and \W+ to represent punctuation or whitespace between words whan looking for phrases. `my @highlights = ( 'foo', 'bar', 'baz', '\wa\W+my\w', # the ...a my... phrase example ); my $re = join '\|', @highlights; $re = qr/\b($re)\b/i; local $/ = "\n\n"; # read in a para at a time while(<DATA>){ s/$re/uc $1/ge; print; } __DATA__ Hello Foo, This is your friend Baz wondering if you would like to go to the bar to talk server foo. Barclays is a foobar bank BTW. So a Perlmonk and a Panda walk into a bar. Know I know you think this is going to be a joke but I saw the panda myself!` [download] cheers tachyon	[reply] [d/l]
Re: Text markup confusion by kvale (Monsignor) on Apr 15, 2004 at 04:44 UTC
You can accomplish this by using the s/// replace function: `my $text = "I saw the panda myself,\nbut Amy did not.\nA var a.my"; my $regexp = qr/(a[ .]*my)/i; $text =~ s/$regexp/'~'.uc $1.'~'/ge; print "$text\n";` [download] gives `I saw the pand~A MY~self, but ~AMY~ did not. A var ~A.MY~` [download] Here, I have tested for optional spaces and periods. Generalization to other punctuation is straightforward. -Mark	[reply] [d/l] [select]
Re: Text markup confusion by leira (Monk) on Apr 15, 2004 at 04:35 UTC
I would show some of my code, but it is embarrasingly inadequate for solving this textual dilemma. I know it can be intimidating to post your code, when you know it's not up to the task and may contain mistakes, but if you give us an idea of what you've done so far, and where you're coming from, it will be easier to help you.	[reply]
Re: Text markup confusion by hv (Prior) on Apr 15, 2004 at 11:12 UTC
If the highlights consist only of letters, then the easiest way to cope with this is probably to construct a regular expression from each that matches (0 or more non-letters) between each pair of letters: `my @words = (qw/ tom dick and harold /); my $pattern = join '\|', map join('[^a-zA-Z]*', split //), @words; my $regexp = qr{$pattern}i; while (<>) { s{($regexp)}{~\U$1\E~}g; print; }` [download] Beware of overlapping matches: the leftmost match will be found, so this would give for example: To c~HAR OLD~ icky pieces of wood you need ~TO M~ake a hot fire ignoring the additional match of 'dick'. Similarly, if one word is a prefix of the other, only the word earlier in the list will be found. If you need different behaviour on such overlaps, there are other approaches depending on what precisely you want it to do. Hugo	[reply] [d/l]