Re^2: regex help

Replies are listed 'Best First'.
Re^3: regex help by ag4ve (Monk) on Jun 28, 2012 at 13:46 UTC
To make sure I've presented what I'm trying to capture, the below were the first three rules of this year. The last example is not preceded by a docket. http://www.gpo.gov/fdsys/pkg/FR-2012-01-03/html/2011-33692.htm `Cooperative Conservation Partnership Initiative and Wetlands Reserve Enhancement Program` [download] http://www.gpo.gov/fdsys/pkg/FR-2012-01-03/html/2011-33662.htm `Privacy Act of 1974; System of Records` http://www.gpo.gov/fdsys/pkg/FR-2012-01-03/html/2011-33656.htm `Revision to the Notice for the Great Lakes and Mississippi River Interbasin Study (GLMRIS) Regarding Public Conference Calls Scheduled for January 10 and February 8, 2012` [download]	[reply] [d/l] [select]
Re^4: regex help by aaron_baugher (Curate) on Jun 28, 2012 at 14:23 UTC
Well, if your text varies too much, it can be difficult to find a pattern that matches every possibility. But looking at those four examples, here's what I see: The stuff you want is always between two lines of hyphens/underscores. There may be other such lines on the page, but these are the first two. I also discovered after some trial and error that these lines may start with a null byte. It may be preceded by other paragraphs you don't want, separated by blank lines. It may be followed by one or more paragraphs, again separated by blank lines, but always beginning with an ALL-CAPS word and a colon. Based on that information, I came up with this, after saving those four files in a subdirectory: #!/usr/bin/env perl use Modern::Perl; my $hr = qr{\n.[-_]+\n}; # close enough for (literally) government wo +rk sub get_rule { my $text = shift; my $keep; $text =~ s/.?$hr//s; # cut away to first line $text =~ s/$hr.+//s; # cut away after second line for my $p (split /\n\n+/, $text) { # loop through paragraphs $keep = $p unless $p =~ /^[A-Z]+:/; # keep this one unless it +matches } return $keep; # print last one matched } for (qw(2011-33656.htm 2011-33662.htm 2011-33692.htm 2012-2363.htm) +) { say; my $page = `cat f/$_`; if (my $keep = get_rule($page)){ say $keep; } else { say "Unable to find a match in $_"; } say '------------------------------------------'; } [download] A couple notes: Obviously, you may run into pages that don't fit the pattern I found in these four examples, and have to adjust accordingly. That's the fun of parsing text that isn't produced according to a consistent format. (I wrote a script to parse bulletins from local auctioneers, and it was excruciating, because most seem to use FrontPage or something worse, and every one is laid out a little differently, even from the same auctioneer.) For instance, I matched on any* single character for the possible null byte that showed up in one file, which was lazy of me. You may have to make that more or less precise as you test it on other files. Also, in this case, I chose to trim away what I didn't want, rather than use a regex with capturing parentheses to capture what I did want. You could certainly do it the other way, but sometimes one appeals to me more than the other for some reason. When I'm going to follow up with more parsing, I find that I prefer to trim my sample as soon and as much as possible. Aaron B. Available for small or large Perl jobs; see my home node.	[reply] [d/l]