in reply to Re^5: regexp over multiple lines
in thread regexp over multiple lines

Well, I've been able to slurp the files and read the data into arrays and it makes the coding much faster and easier. I no longer have to use anchor points and offset values and other messy stuff. I'm not getting the exact results I want yet but I'm getting there.

I have another question:

Is it possible to limit the scope of the regexp I'm using (ie. linit the search to an area I define by a regexp which defines the search block of text)? For example:

<p id=paragraph_1> <a href="http://www.link1.com">Link1</a> <a href="http://www.link2.com">Link2</a> <a href="http://www.link3.com">Link3</a> </p> <p id=paragraph_2> <a href="http://www.link4.com">Link4</a> <a href="http://www.link5.com">Link5</a> <a href="http://www.link6.com">Link6</a> </p>

If I use a regexp to parse the names of the above html links, I'm going to get all of them. What if I only want the ones within the paragraph_2 tags, how would I do that?

Here's an dummy example of code I already have:

local($/, *WEB_DATA);#sets $/ to undef for you and when the scope exits it will revert $/ back to its previous value (most likely "\n") open (WEB_DATA, "<$myFilename.tmp"); my $myData = <WEB_DATA>; close (WEB_DATA); my @linkName = $myData =~ m/regexp for linkName/g;

I'm not sure if I've described it correctly, but what I want if to use a regexp like /<p id=paragraph_1>.+?<\/p>/ to define where I want to look, and another rexexp to define what data I want to parse within this block. I hope that makes sense.

Replies are listed 'Best First'.
Re^7: regexp over multiple lines
by ww (Archbishop) on Aug 04, 2011 at 18:21 UTC

    Here's one way (using basicly the method you describe in the last para above (+ +), short-circuited to skip any <para id="n"> unless "n" is "2"):

    #!/usr/bin/perl use strict; use warnings; use 5.012; # 918508 wants linknames from para 2 only my @data = ('<p id=paragraph_1>', '<a href="http://www.link1.com">Link1</a>', '<a href="http://www.link2.com">Link2</a>', '<a href="http://www.link3.com">Link3</a>', '</p>', '<p id=paragraph_2>', '<a href="http://www.link4.com">Link4</a>', '<a href="http://www.link5.com">Link5</a>', '<a href="http://www.link6.com">Link6</a>', '</p>',); my (@linkName, $linkName); my $flag = 0; for my $data(@data) { chomp $data; if ( $data =~ /<p id=paragraph_2>/ ) { # when the above is true, we've found para 2 $flag = 1; } if ( $flag && ($data !~ /<p id=paragraph_2>/) ) { # now, we want to skip the data --the para heading -- # the first time we arrive here, but if it's not the h +eading, # then capture the link title if ( $data =~ m/<a href=.+>(Link\d)<.+/ ) { $linkName = $1; push @linkName, $linkName; } } } for my $extracted(@linkName) { say $extracted; }

    Output:

    Link4 Link5 Link6

    Generalization -- to suit specific needs -- is left as an exercise for the OP.

    And BTW, html attribute values should be inside quotes... <p id="para1" class="b">.

    Update: Para 1 extended to note that OP was on the right track with the method in the last para of his last previous post.

      Thanks for the detailed reply :-)

      Although I can follow what you are doing in your code example, I'm a bit confused about one thing. You seem to be processing the @data array line by line, whereas I'm reading my data into a single string variable (slurping, as recommended by the forum). Please correct me if I've misunderstood :-)

      Where you have:

      my @data = ('<p id=paragraph_1>', '<a href="http://www.link1.com">Link1</a>', '<a href="http://www.link2.com">Link2</a>', '<a href="http://www.link3.com">Link3</a>', '</p>', '<p id=paragraph_2>', '<a href="http://www.link4.com">Link4</a>', '<a href="http://www.link5.com">Link5</a>', '<a href="http://www.link6.com">Link6</a>', '</p>',);

      I seem to have:

      my $myData = <WEB_DATA>;

      While googling, I came across the pos() function, which is used to find the offset or position of the last matched substring. Maybe this could help me when slurping files?