danj35 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I've tried asking this before, but have yet to have a reply that works. Have read so much documentation on regular expressions now that I think I'm going crazy! Seems like a simple problem to solve to me, so I'll try and be as clear as possible. Here goes:

I have taken a webpage to a variable and I need to extract various paragraphs of text from it. I have a working line of code that extracts text from the following article:

A webpage.

===Comments===

This webpage contains information bla bla bla

=Section 2=

Some more text here.

===Comments===

Some other comments here.

=Another section=

=Aditional Notes=

More notes here.

The code that I currently have extracts all the info between "===Comments===" and "Section 2=", as so:

if(Dumper($page) =~ /===Comments===(.*?)=Section 2=/s ) { $lit_comments = $1; }

What I can't seem to do now is extract the next block of text below the second comments box between "===Comments===" and "=Another Section=", as the start tag is already found earlier in the article.

As a secondary point. I also need to extract all the text after the "=Aditional Notes=" section. The problem is I do not know what the end tag for this will be, as it will be the last word used here (i.e. the last character in the webpage).

I hope this is clear. Any help would be great. Cheers!

Replies are listed 'Best First'.
Re: Extracting Text Using Regular Expressions Problem
by kennethk (Abbot) on May 07, 2010 at 15:12 UTC
    As a meta-answer to you question, if you keep posting questions and don't get working answers, it's quite possible you are asking the wrong questions. A read through I know what I mean. Why don't you?, XY Problem and On Asking Questions of Bears might do you well. As well, if code provided doesn't work, make sure you are giving us good examples of input and output. It's also possible that regular expressions are not the right tool for the task you are trying to accomplish.

    As well, as you've posted on this issue before, it's generally considered good form to keep it in a thread or at least cite your previous postings on the issue (I'm guessing Search for Second Occurence of Substing and get containing text and Searching string for paragraph..).

    For your actual question, your specification leaves something to be desired. I will read it as "Extract all text following '===Comments===' until the next '=' or end of file and then extract all text following '=Aditional Notes=' until the next '=' or end of file". Note the misspelling of Aditional<sic>. The following will take your posted material and capture the strings in question into arrays. If this does not work for your actual text, post that case so we can have accurate input for test cases.

    #!/usr/bin/perl use strict; use warnings; my $text = do { local $/; #slurp <DATA>; }; my @comments = $text =~ /(?<====Comments===).*?(?==|$)/gs; my @additional = $text =~ /(?<==Aditional Notes=).*?(?==|$)/gs; 1; __DATA__ ===Comments=== This webpage contains information bla bla bla =Section 2= Some more text here. ===Comments=== Some other comments here. =Another section= =Aditional Notes= More notes here.
Re: Extracting Text Using Regular Expressions Problem
by JavaFan (Canon) on May 07, 2010 at 15:12 UTC
    What I can't seem to do now is extract the next block of text below the second comments box between "===Comments===" and "=Another Section=", as the start tag is already found earlier in the article.
    First of all, your sample text doesn't contain "=Another Section=". It does contain "=Another section=", but for the (non-folding) regexp engine, 's' is as different from 'S' as '!' is.

    Second, I'm not sure whether I spot what your problem is here. Could you elaborate?

    As a secondary point. I also need to extract all the text after the "=Aditional Notes=" section. The problem is I do not know what the end tag for this will be, as it will be the last word used here (i.e. the last character in the webpage).
    Is there an end tag? That is, you need to match everything up to, but not including, the terminating character? /(?s:.)/ matches any character, so you could use /(?s:.)$/ as your "end tag".

    Of course, if you just want to match every thing after "=Aditional Notes=", then just use /=Aditional Notes=(?s:.*)/.

Re: Extracting Text Using Regular Expressions Problem
by Marshall (Canon) on May 07, 2010 at 15:56 UTC
    update: it appears that I didn't understand all the requirements when I wrote this code. But hopefully it will help you in your endeavor. This shows how to get all of the comment blocks. From what I understand there is a single =Additional Notes= section at the very end. Make a 2nd regex along the same line of thought as below to get that section, but since it is the very last section, then terminator is not needed, eg.
    m/[=]+Additional Notes[=]+.*?\n(.*)/s; #this (.*) will get all #to end of the $page #see below, ending [=]+ and /g +is not #needed for this job

    #!/usr/bin/perl -w use strict; open (IN , '<', "awebpage.txt") or die; my @page = <IN>; #this is like a "slurp" into a scalar my $page = join('',@page); #with undef record seperator my @comments = $page =~ m/[=]+Comments[=]+.*?\n(.*?)[=]+/gs; my $count =1; foreach (@comments) { print "COMMENT #$count is:\n$_"; $count++; } =file awebpage.txt is: A webpage. ===Comments=== This webpage contains information bla bla bla =Section 2= Some more text here. whatever ===Comments=== Some other comments here. =Another section= =Aditional Notes= =Comments= some more comments and notes here =Notes= More notes here. =cut =****prints:**** COMMENT #1 is: This webpage contains information bla bla bla COMMENT #2 is: Some other comments here. COMMENT #3 is: some more comments and notes here =cut

      Thanks. That works perfectly. Glad to have put this problem to bed!

Re: Extracting Text Using Regular Expressions Problem
by ww (Archbishop) on May 07, 2010 at 20:21 UTC
    And perhaps your questions would benefit from formatting the data (the supposed content of the webpage) as well. See Markup in the Monastery.

    I'm also more than a bit suspicious of your data: Is the webpage from which you have "taken" the data a wiki or somesuch?

    Or, assuming your data is accurately represented above, using...

    =~ /(?:===Comments===(.*?)=Section \d=)|(?:===Comments===(.*?)=Another Section=)/sg

    might work for multiple comment sections. Sorry, I hate to post untested code, but this is, due to press of time

    As to your problem re the end of the page: the last word the server knows about is </html> so the last character will be >, even though that will leave your with some markup tags to remove (possibly "</p></body></html").