meraxes has asked for the wisdom of the Perl Monks concerning the following question:

This humble one begs wisdom of the monks.

This is a text processing problem that I'm having trouble with. Apologies for being long-winded but I wanted to make sure I was very precise.

The Background

I have a flatfile which is coded in Folio 3 format. Folio basically a specially coded text file (typsetters coding crossed with HTML and some index fields) and compiles that into an infobase which is a hierarchically organised and indexed "book" that also allows hyperlinks to bounce from one section to another. One line in the textfile = One record in the infobase = One paragraph. This isn't really important, but if someone knows folio they may have a little more insight into it. Basically, it's a marked up text file. Most of my stuff deals with law.

The Problem

My "book" is composed mostly of legislation. Now, my problem is in the linking of sections with statutes (acts, bills, whatever you call them, I don't know all the legal terms and it depends on what country you live in). I have legislation "linked" like this within the paragraphs:

What I need to do is put a link directly to the section as well as the statute. So, after linking, the above text would look like so:

That in itself isn't hard to do, but there are many variations.

The section number isn't necessarily just a number and is best represented by the regex: \d+(\.\d+)?(\([^\)]+\))*. So things like: 3, 3.1, 3.2(1), 3(1)(b), 3.5(1)(a)(iv) are all valid

The section is followed by the word "of" and then possibly "the" which is followed by <JL:ref2,\"[^\"]+\">. Now here's a brute force regex I have to convert it:

my $line = "... can be found in s. 3 of the <JL:ref2,\"Interpretation +Act\">Interpretation Act</JL>. As Wilson J. clearly states..."; print "Before: $line\n"; $line =~ s/ s\. (\d+(\.\d+)?(\([^\)]+\))*) of( the)? (<JL:ref2,\"([^\" +>]+)\">)/ <JL:ref2,\"$6 $1\">s\. $1<\/JL> of$4 $5/g; print " After: $line\n";

It's trivial to modify this for lines that read like this (areas to link are in bold):

What's not trivial is something like this:

Is there a regex I can use that will allow me to individually link all of the numbers when I don't know how many there will be in the list? I don't believe so. Sure, I can extract the whole text fragment easily enough, but I need to link each number uniquely. I also may need to do it several times within one paragraph on one go of the regex. I had also thought something like a while($line =~ /REGEX/g) but I can't think of how to apply it. Extract the whole section of text and break it up with a split on the commas? I'm sure there's a simple way to do this but I'm awfully brain fried on this one. Looooong day.

Any help is greatly appreciated.

Replies are listed 'Best First'.
Re: Text Processing - Constructing Hyperlinks
by gav^ (Curate) on May 01, 2002 at 23:26 UTC
    I think this might be close, I'm not exactly sure what output you want for those test cases. You might have to tweak the return statement slightly in fix_xref. Usual warnings about parsing apply, is there optional white space, can the link text contain a '>', etc.
    sub fix_xref { my ($parts, $text, $ref, $action, $link_text) = @_; $parts =~ s@(\d+(?:\.\d+)?(?:\([\w\d]+\))*)@<JL:ref$ref,"$action $ +1">s. $ref</JL>@g; return qq@$parts $text <JL:ref$ref,"$action">$link_text</JL>@; } while (<DATA>) { s@s\. (.+?) (of(?: the)) <JL:ref(\d+),"([^"]+)">([^<]+)</JL>@fix_x +ref($1,$2,$3,$4,$5)@eg; print; print "\n"; } __DATA__ s. 3 and 4 of the <JL:ref2,"Interpretation Act">Interpretation Act</JL +> s. 3 or 5 of the <JL:ref2,"Interpretation Act">Interpretation Act</JL> + s. 3, 4, 5, 6, 7 or 8 of the <JL:ref2,"Interpretation Act">Interpretat +ion Act</JL>

    gav^

      That doesn't do exactly what I want it to, but I see now how to work it! Many thanks!

      The "JL:ref2" part of the link is static. The "ref2" is a link style so it doesn't actually need to be changed so I wasn't sure what it was you were doing with it. I also didn't need the "s." in front of each number. Just the first. You have given me the tools I need to figure it out though!

      UPDATE

      Okay, here's what I developed from it:

      sub link_numbers ($$) { my $data = $_[0]; my $link = $_[1]; $data =~ s/([s]?s\. )?(\d+(?:\.\d+)?(?:\([^\)]+\))*)/<JL:ref2,\"$li +nk $2\">$1$2<\/JL>/g; return $data; } while ($line = <DATA>) { $line =~ s/ ((?:(?:[s]?s\. )?\d+(?:\.\d+)?(?:\([^\)]+\))*(?:, | an +d | or )?)+)((?: of| the|,)+ )(<JL:ref2,\"([^\">]+)\">)/" " . link_nu +mbers($1,$4) . "$2$3"/ge; print $line; print "\n"; } __DATA__ This can be found in ss. 3 and 4 of the <JL:ref2,"Interpretation Act"> +Interpretation Act</JL>. As discussed in s. 3 or 5 of the <JL:ref2,"Interpretation Act">Interpr +etation Act</JL>. Filling up space in ss. 3, 4, 5, 6, 7 or 8 of the <JL:ref2,"Interpreta +tion Act">Interpretation Act</JL> .

      Again, many thanks. You saved my bacon.

A reply falls below the community's threshold of quality. You may see it by logging in.