Well, if your text varies too much, it can be difficult to find a pattern that matches every possibility. But looking at those four examples, here's what I see:
Based on that information, I came up with this, after saving those four files in a subdirectory:
#!/usr/bin/env perl use Modern::Perl; my $hr = qr{\n.[-_]+\n}; # close enough for (literally) government wo +rk sub get_rule { my $text = shift; my $keep; $text =~ s/.*?$hr//s; # cut away to first line $text =~ s/$hr.+//s; # cut away after second line for my $p (split /\n\n+/, $text) { # loop through paragraphs $keep = $p unless $p =~ /^[A-Z]+:/; # keep this one unless it +matches } return $keep; # print last one matched } for (qw(2011-33656.htm 2011-33662.htm 2011-33692.htm 2012-2363.htm) +) { say; my $page = `cat f/$_`; if (my $keep = get_rule($page)){ say $keep; } else { say "Unable to find a match in $_"; } say '------------------------------------------'; }
A couple notes: Obviously, you may run into pages that don't fit the pattern I found in these four examples, and have to adjust accordingly. That's the fun of parsing text that isn't produced according to a consistent format. (I wrote a script to parse bulletins from local auctioneers, and it was excruciating, because most seem to use FrontPage or something worse, and every one is laid out a little differently, even from the same auctioneer.) For instance, I matched on any single character for the possible null byte that showed up in one file, which was lazy of me. You may have to make that more or less precise as you test it on other files.
Also, in this case, I chose to trim away what I didn't want, rather than use a regex with capturing parentheses to capture what I did want. You could certainly do it the other way, but sometimes one appeals to me more than the other for some reason. When I'm going to follow up with more parsing, I find that I prefer to trim my sample as soon and as much as possible.
Aaron B.
Available for small or large Perl jobs; see my home node.
In reply to Re^4: regex help
by aaron_baugher
in thread regex help
by ag4ve
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |