sulfericacid has asked for the wisdom of the Perl Monks concerning the following question:

The code below prints the entire line that the search is found on rather than just what I'm trying to get. It's also printing <p> and </p> to browser while I'm trying to pull back only the text rather than the codes. Anyone have any suggestions?
#!/usr/bin/perl -w use CGI qw(:all); use Fcntl qw(:flock); use HTTP::Request; print header; open(F, "pageinfo.htm"); while(<F>) { print "$_"; } if(!param) { exit; } $fristring = param('urlinfo'); use LWP::UserAgent; $ua = new LWP::UserAgent; $ua->agent("Mozilla/5.0"); $ua->timeout('30'); $req = new HTTP::Request GET => $fristring; $req->header('Accept' => 'text/html'); $result = $ua->request($req); $_ = $result->content; $looper = 0; while($looper < 5000) { s/</&lt;/; $looper++; } @lines = split /\n/, $_; print "<textarea>"; foreach $line (@lines) { if($line =~ /test/) { $1 =~ s/<p>//gi; print $1; } if($line =~ /meta/) { print "$line" }; } # print $_; print "<\/textarea>";


"Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

sulfericacid

Escaped html in first 'graph - dvergin 2003-03-03

Replies are listed 'Best First'.
Re: =~ and substitutions
by chromatic (Archbishop) on Mar 03, 2003 at 20:24 UTC

    Here is where your problem lies (reformatted for readability):

    foreach $line (@lines) { if($line =~ /test/) { $1 =~ s/<p>//gi; print $1; } if($line =~ /meta/) { print "$line" }; }

    You're looping through all of the lines of content. You test each line to see if it contains the sequence 'test'. If so, you perform a substitution on $1, which contains the first capture of the most recent successful regular expression. Unfortunately, there have been no captures. You print the modified variable, which is doubly-odd, as it's a read-only value. If the line contains the sequence 'meta', you print it with no substitutions.

    It's unclear what you're trying to do, unless you just want to strip out opening paragraph tags. That might be:

    if ($line =~ /test/) { $line =~ s/<p>//g; print $line; }

    I think you'll have to give an example of what might be in $line before you scrub it and an example of the scrubbed version, though.

Re: =~ and substitutions
by dws (Chancellor) on Mar 03, 2003 at 20:20 UTC
    It's also printing <p> and </p> to browser while I'm trying to pull back only the text rather than the codes.

    Where, precisely, do you think $1 is getting set? If your answer is anything other than "oops", look again.

    Also, if you're emitting stuff into a <textarea> block, you need to escape certain HTML entities, notably as '<', '>', and '&'.

Re: =~ and substitutions
by tall_man (Parson) on Mar 03, 2003 at 19:57 UTC
    I can't see why you didn't get the error "Modification of a read-only value attempted" when you tried to substitute using $1. Variables like that are only used for reading the results of a match, and they only work if capturing parentheses are used. For example:
    if($line =~ /(test)/) { print $1,"\n"; }
    That would print a line saying "test" for each line containing the string "test"

    It's not clear from your post what you mean by "just what I'm trying to get." Would you please clarify what you are trying to do?

      Let's say I am searching for the word "test", when the results are brought back it brings back the entire line of code with "test" inside rather than just the single word I was looking for. And my substition isn't removing the code tags when all I want is the text.

      "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

      sulfericacid
        Ok, here's a sample program with some input data and resulting output. Is this what you mean?
        use strict; my @lines = <DATA>; my $line; foreach $line (@lines) { if($line =~ /(test)/) { print $1,"\n"; $line =~ s/<p>//gi; } if($line =~ /meta/) { print "$line" }; } __END__ test <p> hello meta fadfdsa <p> meta fadfasd test test <P> test meta <p> fdafasdg
        Every line containing "test" will cause a line with the word "test" in the output. Lines containing "test" and "meta" will have paragraph code tags stripped. Lines containing "meta" alone will be printed unchanged. Other lines will be skipped.
        test test hello meta fadfdsa <p> meta test test test test meta
Re: =~ and substitutions
by jobber (Sexton) on Mar 03, 2003 at 20:21 UTC
    Hello, This is how to get the code between the tags
    $_ =~ /(<p>)(.*)(<\/p>)/i; print $2;
    The number $2 picks up the value since it is in the second set of ().
    hope this helps

      Careful! There are a few gotchas in that snippet. First, there's a greedy match. If you have a string containing <p>one paragraph</p><p>another paragraph</p>, $2 will contain one paragraph</p><p>another paragraph. Besides that, if the regex succeeds, $2 won't contain what you expect. You can also leave off the default variable and certainly don't need to capture the paragraph tags. I'd prefer:

      print $1 if m!<p>(.*?)</p>!;

      Of course, if there's anything more complicated than bare paragraph tags, you're better off using an HTML parser module.