Re^6: regexp over multiple lines

Well, I've been able to slurp the files and read the data into arrays and it makes the coding much faster and easier. I no longer have to use anchor points and offset values and other messy stuff. I'm not getting the exact results I want yet but I'm getting there.

I have another question:

Is it possible to limit the scope of the regexp I'm using (ie. linit the search to an area I define by a regexp which defines the search block of text)? For example:

<p id=paragraph_1>
  <a href="http://www.link1.com">Link1</a> 
  <a href="http://www.link2.com">Link2</a> 
  <a href="http://www.link3.com">Link3</a> 
</p>
<p id=paragraph_2>
  <a href="http://www.link4.com">Link4</a> 
  <a href="http://www.link5.com">Link5</a> 
  <a href="http://www.link6.com">Link6</a> 
</p>
[download]

If I use a regexp to parse the names of the above html links, I'm going to get all of them. What if I only want the ones within the paragraph_2 tags, how would I do that?

Here's an dummy example of code I already have:

local($/, *WEB_DATA);#sets $/ to undef for you
                      and when the scope exits
                      it will revert $/ back to
                      its previous value (most
                      likely "\n")
open (WEB_DATA, "<$myFilename.tmp");
  my $myData = <WEB_DATA>;
close (WEB_DATA);
my @linkName = $myData =~ m/regexp for linkName/g;
[download]

I'm not sure if I've described it correctly, but what I want if to use a regexp like /<p id=paragraph_1>.+?<\/p>/ to define where I want to look, and another rexexp to define what data I want to parse within this block. I hope that makes sense.

Comment on Re^6: regexp over multiple lines Select or Download Code

Replies are listed 'Best First'.
Re^7: regexp over multiple lines by ww (Archbishop) on Aug 04, 2011 at 18:21 UTC
Here's one way (using basicly the method you describe in the last para above (+ +), short-circuited to skip any `<para id="n">` unless "n" is "2"): #!/usr/bin/perl use strict; use warnings; use 5.012; # 918508 wants linknames from para 2 only my @data = ('<p id=paragraph_1>', '<a href="http://www.link1.com">Link1</a>', '<a href="http://www.link2.com">Link2</a>', '<a href="http://www.link3.com">Link3</a>', '</p>', '<p id=paragraph_2>', '<a href="http://www.link4.com">Link4</a>', '<a href="http://www.link5.com">Link5</a>', '<a href="http://www.link6.com">Link6</a>', '</p>',); my (@linkName, $linkName); my $flag = 0; for my $data(@data) { chomp $data; if ( $data =~ /<p id=paragraph_2>/ ) { # when the above is true, we've found para 2 $flag = 1; } if ( $flag && ($data !~ /<p id=paragraph_2>/) ) { # now, we want to skip the data --the para heading -- # the first time we arrive here, but if it's not the h +eading, # then capture the link title if ( $data =~ m/<a href=.+>(Link\d)<.+/ ) { $linkName = $1; push @linkName, $linkName; } } } for my $extracted(@linkName) { say $extracted; } [download] Output: `Link4 Link5 Link6` [download] Generalization -- to suit specific needs -- is left as an exercise for the OP. And BTW, html attribute values should be inside quotes... `<p id="para1" class="b">`. Update: Para 1 extended to note that OP was on the right track with the method in the last para of his last previous post.	[reply] [d/l] [select]
Re^8: regexp over multiple lines by liverpaul (Acolyte) on Aug 05, 2011 at 10:24 UTC
Thanks for the detailed reply :-) Although I can follow what you are doing in your code example, I'm a bit confused about one thing. You seem to be processing the @data array line by line, whereas I'm reading my data into a single string variable (slurping, as recommended by the forum). Please correct me if I've misunderstood :-) Where you have: `my @data = ('<p id=paragraph_1>', '<a href="http://www.link1.com">Link1</a>', '<a href="http://www.link2.com">Link2</a>', '<a href="http://www.link3.com">Link3</a>', '</p>', '<p id=paragraph_2>', '<a href="http://www.link4.com">Link4</a>', '<a href="http://www.link5.com">Link5</a>', '<a href="http://www.link6.com">Link6</a>', '</p>',);` [download] I seem to have: `my $myData = <WEB_DATA>;` While googling, I came across the pos() function, which is used to find the offset or position of the last matched substring. Maybe this could help me when slurping files?	[reply] [d/l] [select]