Well, if your text varies too much, it can be difficult to find a pattern that matches every possibility. But looking at those four examples, here's what I see:

  1. The stuff you want is always between two lines of hyphens/underscores. There may be other such lines on the page, but these are the first two. I also discovered after some trial and error that these lines may start with a null byte.
  2. It may be preceded by other paragraphs you don't want, separated by blank lines.
  3. It may be followed by one or more paragraphs, again separated by blank lines, but always beginning with an ALL-CAPS word and a colon.

Based on that information, I came up with this, after saving those four files in a subdirectory:

#!/usr/bin/env perl use Modern::Perl; my $hr = qr{\n.[-_]+\n}; # close enough for (literally) government wo +rk sub get_rule { my $text = shift; my $keep; $text =~ s/.*?$hr//s; # cut away to first line $text =~ s/$hr.+//s; # cut away after second line for my $p (split /\n\n+/, $text) { # loop through paragraphs $keep = $p unless $p =~ /^[A-Z]+:/; # keep this one unless it +matches } return $keep; # print last one matched } for (qw(2011-33656.htm 2011-33662.htm 2011-33692.htm 2012-2363.htm) +) { say; my $page = `cat f/$_`; if (my $keep = get_rule($page)){ say $keep; } else { say "Unable to find a match in $_"; } say '------------------------------------------'; }

A couple notes: Obviously, you may run into pages that don't fit the pattern I found in these four examples, and have to adjust accordingly. That's the fun of parsing text that isn't produced according to a consistent format. (I wrote a script to parse bulletins from local auctioneers, and it was excruciating, because most seem to use FrontPage or something worse, and every one is laid out a little differently, even from the same auctioneer.) For instance, I matched on any single character for the possible null byte that showed up in one file, which was lazy of me. You may have to make that more or less precise as you test it on other files.

Also, in this case, I chose to trim away what I didn't want, rather than use a regex with capturing parentheses to capture what I did want. You could certainly do it the other way, but sometimes one appeals to me more than the other for some reason. When I'm going to follow up with more parsing, I find that I prefer to trim my sample as soon and as much as possible.

Aaron B.
Available for small or large Perl jobs; see my home node.


In reply to Re^4: regex help by aaron_baugher
in thread regex help by ag4ve

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.