meraxes has asked for the wisdom of the Perl Monks concerning the following question:
This humble one begs wisdom of the monks.
This is a text processing problem that I'm having trouble with. Apologies for being long-winded but I wanted to make sure I was very precise.
I have a flatfile which is coded in Folio 3 format. Folio basically a specially coded text file (typsetters coding crossed with HTML and some index fields) and compiles that into an infobase which is a hierarchically organised and indexed "book" that also allows hyperlinks to bounce from one section to another. One line in the textfile = One record in the infobase = One paragraph. This isn't really important, but if someone knows folio they may have a little more insight into it. Basically, it's a marked up text file. Most of my stuff deals with law.
My "book" is composed mostly of legislation. Now, my problem is in the linking of sections with statutes (acts, bills, whatever you call them, I don't know all the legal terms and it depends on what country you live in). I have legislation "linked" like this within the paragraphs:
What I need to do is put a link directly to the section as well as the statute. So, after linking, the above text would look like so:
That in itself isn't hard to do, but there are many variations.
The section number isn't necessarily just a number and is best represented by the regex: \d+(\.\d+)?(\([^\)]+\))*. So things like: 3, 3.1, 3.2(1), 3(1)(b), 3.5(1)(a)(iv) are all valid
The section is followed by the word "of" and then possibly "the" which is followed by <JL:ref2,\"[^\"]+\">. Now here's a brute force regex I have to convert it:
my $line = "... can be found in s. 3 of the <JL:ref2,\"Interpretation +Act\">Interpretation Act</JL>. As Wilson J. clearly states..."; print "Before: $line\n"; $line =~ s/ s\. (\d+(\.\d+)?(\([^\)]+\))*) of( the)? (<JL:ref2,\"([^\" +>]+)\">)/ <JL:ref2,\"$6 $1\">s\. $1<\/JL> of$4 $5/g; print " After: $line\n";
It's trivial to modify this for lines that read like this (areas to link are in bold):
What's not trivial is something like this:
Is there a regex I can use that will allow me to individually link all of the numbers when I don't know how many there will be in the list? I don't believe so. Sure, I can extract the whole text fragment easily enough, but I need to link each number uniquely. I also may need to do it several times within one paragraph on one go of the regex. I had also thought something like a while($line =~ /REGEX/g) but I can't think of how to apply it. Extract the whole section of text and break it up with a split on the commas? I'm sure there's a simple way to do this but I'm awfully brain fried on this one. Looooong day.
Any help is greatly appreciated.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Text Processing - Constructing Hyperlinks
by gav^ (Curate) on May 01, 2002 at 23:26 UTC | |
by meraxes (Friar) on May 02, 2002 at 13:04 UTC | |
| A reply falls below the community's threshold of quality. You may see it by logging in. |