in reply to Pattern Search on HTML source.

Although you have correctly made your regex non-greedy, it "can't get a match" because 1nd (in the regex) does not match 1st (in the data):

my $output = "<!-- 1st table -->What I want 1<!-- /1st table -->more s +tuff...<!-- 2st movie -->What I want 2<!-- /2st movie -->...more stuf +f...<!-- 3st movie -->What I want 3<!-- /3st movie -->...more stuff"; if ($output =~ /<!-- 1st table -->(.*?)<!-- \/1st table -->/g) { print $1; } else{ print "Nothing Here!"; }
cheerfully spits out
perl 23.pl
What I want 1
However
  1. Your /g isn't doing what your think. You've tried to specify a single set of tags. /g will find the content between them if they're repeated, but it won't find "2st sic movie
  2. Your pseudo-html makes no sense: tables without rows or data cells?
  3. Using LWP or similar, if you're not, could save you the trouble of saving the source data as a text file
  4. It's a tad peculiar to name the input FH in your code as "OUTPUT"
  5. and, if you're going to parse html, use a module. There are just too many ways to go wrong while rolling your own.

Replies are listed 'Best First'.
Re^2: Pattern Search on HTML source.
by Anonymous Monk on Dec 31, 2007 at 19:05 UTC
    The problem is that is the tags has sometihing like:

    my $output = "<!-- 1st table --> What I want 1<!-- /1st table -->more stuff...<!-- 2st movie --> What I want 2<!-- /2st movie -->...more stuff...<!-- 3st movie -->What + I want 3<!-- /3st movie -->...more stuff";


    Like a carriage return or something like that I can't get it to match.
      am: use the download download link beneath the code to capture it rather than copy-pasting... or remove the newlines from what you copy-pasted until you have the $output as a single line in your editor.

      and... updating the previous: I realized, belatedly, that you appear to want to capture the contents of all the tag pairs, rather than just the first. Sorry, the code I posted captures only the first and so far, I haven't worked out a simple (aka, "elegant") and understandable way to do them all with a regex. CF advise to use an html parser or (new suggestion) a module designed to deal with matching pairs. Perhaps wiser monks will offer more particular suggestions.