in reply to regex help
The first page you linked to doesn't appear to contain the text you want from it, at least when I visited it just now. Could you give us another example?
Aaron B.
Available for small or large Perl jobs; see my home node.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: regex help
by ag4ve (Monk) on Jun 28, 2012 at 13:36 UTC | |
doh, I messed that up. Per the link http://www.gpo.gov/fdsys/pkg/FR-2012-02-03/html/2012-2363.htm the text I want is:
I'll go through and find others though... | [reply] [d/l] |
by ag4ve (Monk) on Jun 28, 2012 at 13:46 UTC | |
To make sure I've presented what I'm trying to capture, the below were the first three rules of this year. The last example is not preceded by a docket. http://www.gpo.gov/fdsys/pkg/FR-2012-01-03/html/2011-33692.htm
http://www.gpo.gov/fdsys/pkg/FR-2012-01-03/html/2011-33662.htm Privacy Act of 1974; System of Recordshttp://www.gpo.gov/fdsys/pkg/FR-2012-01-03/html/2011-33656.htm
| [reply] [d/l] [select] |
by aaron_baugher (Curate) on Jun 28, 2012 at 14:23 UTC | |
Well, if your text varies too much, it can be difficult to find a pattern that matches every possibility. But looking at those four examples, here's what I see: Based on that information, I came up with this, after saving those four files in a subdirectory:
A couple notes: Obviously, you may run into pages that don't fit the pattern I found in these four examples, and have to adjust accordingly. That's the fun of parsing text that isn't produced according to a consistent format. (I wrote a script to parse bulletins from local auctioneers, and it was excruciating, because most seem to use FrontPage or something worse, and every one is laid out a little differently, even from the same auctioneer.) For instance, I matched on any single character for the possible null byte that showed up in one file, which was lazy of me. You may have to make that more or less precise as you test it on other files. Also, in this case, I chose to trim away what I didn't want, rather than use a regex with capturing parentheses to capture what I did want. You could certainly do it the other way, but sometimes one appeals to me more than the other for some reason. When I'm going to follow up with more parsing, I find that I prefer to trim my sample as soon and as much as possible. Aaron B. | [reply] [d/l] |
|
Re^2: regex help
by ag4ve (Monk) on Jun 29, 2012 at 13:17 UTC | |
ya Know, i think you have some good points in that Re^4. i think the main one that i'm taking away is to split() things up into some form that allows me to loop and pick things apart easier - so i could split /\n{2,}|\\S+\/ (or some such) which should give me reasonable sections, and then deal with it from there. i'm sure this is slower, but my bottleneck are the http requests anyway (and i'm sure the gpo doesn't want me to thread this process). thanks for the help. | [reply] |
|
Re^2: regex help
by ag4ve (Monk) on Jul 02, 2012 at 15:44 UTC | |
there are other 'pages' so i might set a counter and after the first match do $match{PAGE . $count} = $split$i ... if i end up caring - i don't think i do. | [reply] [d/l] |