Re^4: regexp over multiple lines

Replies are listed 'Best First'.
Re^5: regexp over multiple lines by ww (Archbishop) on Aug 03, 2011 at 22:57 UTC
"I'm going to try to do it the way everyone is recommending..." Good! My arms get terribly tired, beating people over the head. "I'm just not sure I have the ability to do it that way...yet." And when you have a problem -- trying to do it the right way -- why, that's why we're here. If you get stuck on some particular point (and have read the docs, etc.) post some code illustrating where you are, sample data and output, and errors from your code, if any. Helping folk at that point is far more gratifying than beati ^H^H^H^H^H, ~~posting a picket line around their~~ --- uh, applying verbal persuasion.	[reply]
Re^6: regexp over multiple lines by liverpaul (Acolyte) on Aug 04, 2011 at 11:38 UTC
Well, I've been able to slurp the files and read the data into arrays and it makes the coding much faster and easier. I no longer have to use anchor points and offset values and other messy stuff. I'm not getting the exact results I want yet but I'm getting there. I have another question: Is it possible to limit the scope of the regexp I'm using (ie. linit the search to an area I define by a regexp which defines the search block of text)? For example: `<p id=paragraph_1> <a href="http://www.link1.com">Link1</a> <a href="http://www.link2.com">Link2</a> <a href="http://www.link3.com">Link3</a> </p> <p id=paragraph_2> <a href="http://www.link4.com">Link4</a> <a href="http://www.link5.com">Link5</a> <a href="http://www.link6.com">Link6</a> </p>` [download] If I use a regexp to parse the names of the above html links, I'm going to get all of them. What if I only want the ones within the paragraph_2 tags, how would I do that? Here's an dummy example of code I already have: `local($/, *WEB_DATA);#sets $/ to undef for you and when the scope exits it will revert $/ back to its previous value (most likely "\n") open (WEB_DATA, "<$myFilename.tmp"); my $myData = <WEB_DATA>; close (WEB_DATA); my @linkName = $myData =~ m/regexp for linkName/g;` [download] I'm not sure if I've described it correctly, but what I want if to use a regexp like `/<p id=paragraph_1>.+?<\/p>/` to define where I want to look, and another rexexp to define what data I want to parse within this block. I hope that makes sense.	[reply] [d/l] [select]
Re^7: regexp over multiple lines by ww (Archbishop) on Aug 04, 2011 at 18:21 UTC
Here's one way (using basicly the method you describe in the last para above (+ +), short-circuited to skip any `<para id="n">` unless "n" is "2"): #!/usr/bin/perl use strict; use warnings; use 5.012; # 918508 wants linknames from para 2 only my @data = ('<p id=paragraph_1>', '<a href="http://www.link1.com">Link1</a>', '<a href="http://www.link2.com">Link2</a>', '<a href="http://www.link3.com">Link3</a>', '</p>', '<p id=paragraph_2>', '<a href="http://www.link4.com">Link4</a>', '<a href="http://www.link5.com">Link5</a>', '<a href="http://www.link6.com">Link6</a>', '</p>',); my (@linkName, $linkName); my $flag = 0; for my $data(@data) { chomp $data; if ( $data =~ /<p id=paragraph_2>/ ) { # when the above is true, we've found para 2 $flag = 1; } if ( $flag && ($data !~ /<p id=paragraph_2>/) ) { # now, we want to skip the data --the para heading -- # the first time we arrive here, but if it's not the h +eading, # then capture the link title if ( $data =~ m/<a href=.+>(Link\d)<.+/ ) { $linkName = $1; push @linkName, $linkName; } } } for my $extracted(@linkName) { say $extracted; } [download] Output: `Link4 Link5 Link6` [download] Generalization -- to suit specific needs -- is left as an exercise for the OP. And BTW, html attribute values should be inside quotes... `<p id="para1" class="b">`. Update: Para 1 extended to note that OP was on the right track with the method in the last para of his last previous post.	[reply] [d/l] [select]
Re^8: regexp over multiple lines by liverpaul (Acolyte) on Aug 05, 2011 at 10:24 UTC