This humble one begs wisdom of the monks.

This is a text processing problem that I'm having trouble with. Apologies for being long-winded but I wanted to make sure I was very precise.

The Background

I have a flatfile which is coded in Folio 3 format. Folio basically a specially coded text file (typsetters coding crossed with HTML and some index fields) and compiles that into an infobase which is a hierarchically organised and indexed "book" that also allows hyperlinks to bounce from one section to another. One line in the textfile = One record in the infobase = One paragraph. This isn't really important, but if someone knows folio they may have a little more insight into it. Basically, it's a marked up text file. Most of my stuff deals with law.

The Problem

My "book" is composed mostly of legislation. Now, my problem is in the linking of sections with statutes (acts, bills, whatever you call them, I don't know all the legal terms and it depends on what country you live in). I have legislation "linked" like this within the paragraphs:

What I need to do is put a link directly to the section as well as the statute. So, after linking, the above text would look like so:

That in itself isn't hard to do, but there are many variations.

The section number isn't necessarily just a number and is best represented by the regex: \d+(\.\d+)?(\([^\)]+\))*. So things like: 3, 3.1, 3.2(1), 3(1)(b), 3.5(1)(a)(iv) are all valid

The section is followed by the word "of" and then possibly "the" which is followed by <JL:ref2,\"[^\"]+\">. Now here's a brute force regex I have to convert it:

my $line = "... can be found in s. 3 of the <JL:ref2,\"Interpretation +Act\">Interpretation Act</JL>. As Wilson J. clearly states..."; print "Before: $line\n"; $line =~ s/ s\. (\d+(\.\d+)?(\([^\)]+\))*) of( the)? (<JL:ref2,\"([^\" +>]+)\">)/ <JL:ref2,\"$6 $1\">s\. $1<\/JL> of$4 $5/g; print " After: $line\n";

It's trivial to modify this for lines that read like this (areas to link are in bold):

What's not trivial is something like this:

Is there a regex I can use that will allow me to individually link all of the numbers when I don't know how many there will be in the list? I don't believe so. Sure, I can extract the whole text fragment easily enough, but I need to link each number uniquely. I also may need to do it several times within one paragraph on one go of the regex. I had also thought something like a while($line =~ /REGEX/g) but I can't think of how to apply it. Extract the whole section of text and break it up with a split on the commas? I'm sure there's a simple way to do this but I'm awfully brain fried on this one. Looooong day.

Any help is greatly appreciated.


In reply to Text Processing - Constructing Hyperlinks by meraxes

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.