in reply to Re^3: regexp over multiple lines
in thread regexp over multiple lines

The trouble is, I'm not a builder. I'm just teaching myself the building trade as I go along :-)

This is just a once off project for my website, so the only thing it costs me is time and effort.

I've decided that I'm going to try to do it the way everyone is recommending, I'm just not sure I have the ability to do it that way...yet. You see, the data I need to extract will be from HTML files and XML files. I will be trying to design a program that will process both types of input. I'll give it a day or two and see how I get on.

Replies are listed 'Best First'.
Re^5: regexp over multiple lines
by ww (Archbishop) on Aug 03, 2011 at 22:57 UTC
    "I'm going to try to do it the way everyone is recommending..."
    Good!

    My arms get terribly tired, beating people over the head.

    "I'm just not sure I have the ability to do it that way...yet."

    And when you have a problem -- trying to do it the right way -- why, that's why we're here. If you get stuck on some particular point (and have read the docs, etc.) post some code illustrating where you are, sample data and output, and errors from your code, if any.

    Helping folk at that point is far more gratifying than beati ^H^H^H^H^H, posting a picket line around their --- uh, applying verbal persuasion.

      Well, I've been able to slurp the files and read the data into arrays and it makes the coding much faster and easier. I no longer have to use anchor points and offset values and other messy stuff. I'm not getting the exact results I want yet but I'm getting there.

      I have another question:

      Is it possible to limit the scope of the regexp I'm using (ie. linit the search to an area I define by a regexp which defines the search block of text)? For example:

      <p id=paragraph_1> <a href="http://www.link1.com">Link1</a> <a href="http://www.link2.com">Link2</a> <a href="http://www.link3.com">Link3</a> </p> <p id=paragraph_2> <a href="http://www.link4.com">Link4</a> <a href="http://www.link5.com">Link5</a> <a href="http://www.link6.com">Link6</a> </p>

      If I use a regexp to parse the names of the above html links, I'm going to get all of them. What if I only want the ones within the paragraph_2 tags, how would I do that?

      Here's an dummy example of code I already have:

      local($/, *WEB_DATA);#sets $/ to undef for you and when the scope exits it will revert $/ back to its previous value (most likely "\n") open (WEB_DATA, "<$myFilename.tmp"); my $myData = <WEB_DATA>; close (WEB_DATA); my @linkName = $myData =~ m/regexp for linkName/g;

      I'm not sure if I've described it correctly, but what I want if to use a regexp like /<p id=paragraph_1>.+?<\/p>/ to define where I want to look, and another rexexp to define what data I want to parse within this block. I hope that makes sense.

        Here's one way (using basicly the method you describe in the last para above (+ +), short-circuited to skip any <para id="n"> unless "n" is "2"):

        #!/usr/bin/perl use strict; use warnings; use 5.012; # 918508 wants linknames from para 2 only my @data = ('<p id=paragraph_1>', '<a href="http://www.link1.com">Link1</a>', '<a href="http://www.link2.com">Link2</a>', '<a href="http://www.link3.com">Link3</a>', '</p>', '<p id=paragraph_2>', '<a href="http://www.link4.com">Link4</a>', '<a href="http://www.link5.com">Link5</a>', '<a href="http://www.link6.com">Link6</a>', '</p>',); my (@linkName, $linkName); my $flag = 0; for my $data(@data) { chomp $data; if ( $data =~ /<p id=paragraph_2>/ ) { # when the above is true, we've found para 2 $flag = 1; } if ( $flag && ($data !~ /<p id=paragraph_2>/) ) { # now, we want to skip the data --the para heading -- # the first time we arrive here, but if it's not the h +eading, # then capture the link title if ( $data =~ m/<a href=.+>(Link\d)<.+/ ) { $linkName = $1; push @linkName, $linkName; } } } for my $extracted(@linkName) { say $extracted; }

        Output:

        Link4 Link5 Link6

        Generalization -- to suit specific needs -- is left as an exercise for the OP.

        And BTW, html attribute values should be inside quotes... <p id="para1" class="b">.

        Update: Para 1 extended to note that OP was on the right track with the method in the last para of his last previous post.