ag4ve has asked for the wisdom of the Perl Monks concerning the following question:

I had help creating a regex to get data out of some text documents (please don't comment about parsing html with regex - the resources aren't really html). The regex I have that works for 90% of the documents is:

($match{PAGE}, $match{DESC}) = $text =~ /\[Pages? ([\d-]+).*\n\n(.*?)\n\nAGENCY/s;

However, a few documents don't have an 'AGENCY' section. For instance <link>http://www.gpo.gov/fdsys/pkg/FR-2012-02-03/html/2012-2363.htm</link> fails this regex match for this reason. What I want from this document is:

Commercial Leasing for Wind Power Development on the Outer Continental Shelf (OCS) Offshore Virginia--Call for Information and Nominations

I can't depend on \\.*\\n+ - there isn't always a 'Docket' or other number above. I've tried:

/\[Pages? ([\d-]+).*\n\n(.*?)\n\n(?:AGENCY|ACTION)/s

/\[Pages? ([\d-]+).*\n\n(.*?)\n\n[A-Z]:/s

I've tried splitting this up into two regexes and so on. For more examples: <link>http://www.gpo.gov/fdsys/browse/collection.action?collectionCode=FR</link>. Also, I'm not adverse to different ideas to get the data I want than searching for the section that comes after the paragraph that I'm interested in.

Replies are listed 'Best First'.
Re: regex help
by locked_user sundialsvc4 (Abbot) on Jun 28, 2012 at 15:46 UTC

    I’m not sure what your surrounding code looks like, but two thoughts do come to mind.   First If your regex that works 90% of the time fails, can you not simply follow that, in an if..elsif structure, with others that might, until one of them hits?   Second, if you are extracting lots of stuff from a well structured doc, you might be able to use a parser, such as Parse::RecDescent, to describe the surrounding-context from which you want to extract information.

Re: regex help
by aaron_baugher (Curate) on Jun 28, 2012 at 13:22 UTC

    The first page you linked to doesn't appear to contain the text you want from it, at least when I visited it just now. Could you give us another example?

    Aaron B.
    Available for small or large Perl jobs; see my home node.

      doh, I messed that up. Per the link http://www.gpo.gov/fdsys/pkg/FR-2012-02-03/html/2012-2363.htm the text I want is:

      Information Collection Activities: Oil, Gas, and Sulphur Operations in the Outer Continental Shelf, Subpart A, General; Submitted for Office of Management and Budget (OMB) Review; Comment Request

      I'll go through and find others though...

        To make sure I've presented what I'm trying to capture, the below were the first three rules of this year. The last example is not preceded by a docket.

        http://www.gpo.gov/fdsys/pkg/FR-2012-01-03/html/2011-33692.htm

        Cooperative Conservation Partnership Initiative and Wetlands Reserve Enhancement Program

        http://www.gpo.gov/fdsys/pkg/FR-2012-01-03/html/2011-33662.htm

         Privacy Act of 1974; System of Records

        http://www.gpo.gov/fdsys/pkg/FR-2012-01-03/html/2011-33656.htm

        Revision to the Notice for the Great Lakes and Mississippi River Interbasin Study (GLMRIS) Regarding Public Conference Calls Scheduled for January 10 and February 8, 2012

      ya Know, i think you have some good points in that Re^4. i think the main one that i'm taking away is to split() things up into some form that allows me to loop and pick things apart easier - so i could split /\n{2,}|\\S+\/ (or some such) which should give me reasonable sections, and then deal with it from there. i'm sure this is slower, but my bottleneck are the http requests anyway (and i'm sure the gpo doesn't want me to thread this process).

      thanks for the help.

      i'll probably put this on github soon enough. however, what i went with for this was:
      my @split = map { s/^\[|\]$//m; $_ } split /\]\n[\S ]*\[|\n{2 +,}/, $text; my %match; for my $i (0 .. $#split) { my ($sec, $desc); ($match{PAGE}) = $split[$i] =~ /Pages? ([\d-]+)/ if (!defi +ned($match{PAGE})); ($sec, $desc) = $split[$i] =~ m/^([A-Z]+):\s*(.*)/msg; $match{$sec} = $desc if ($sec and $desc); if (($match{ACTION} or $match{AGENCY}) and !$match{DESC}) +{ $match{DESC} = $split[$i - 1]; } }
      there are other 'pages' so i might set a counter and after the first match do $match{PAGE . $count} = $split$i ... if i end up caring - i don't think i do.