http://qs1969.pair.com?node_id=826689

amir_e_a has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am proofreading the WikiSource edition of Gesenius' Hebrew Grammar, a famous grammar book of the Biblical Hebrew language.

It has many references to the Bible using a MediaWiki template in this form: {{GHGbible-ref|book=Gn|chapter=2|verse=3}}; this creates a link to Genesis chapter 2 verse 3.

This template has already been used in many places where it is needed, but i want to put it automatically everywhere else. Basically, there are many pieces of text that go like this:

<p>this preposition appears in {{GHGbible-ref|book=Gn|chapter=2|verse=3}} and also in 4:5.</p>

I want to turn this into:

<p>this preposition appears in {{GHGbible-ref|book=Gn|chapter=2|verse=3}} and also in {{GHGbible-ref|book=Gn|chapter=4|verse=5}}.</p>

I can do it using this regex:

<p>s/(\{\{GHGbible-ref\|book=([^|]+)\|chapter=\d+\|verse=\d+\}\})(.+?)(\d+):(\d+)/$1$3{{GHGbible-ref|book=$2|chapter=$4|verse=$5}}/</p>

The problem: sometimes the source text is

<p>this preposition appears in {{GHGbible-ref|book=Gn|chapter=2|verse=3}} and also in {{GHGbible-ref|book=Ru|chapter=4|verse=5}} and 4:5.</p>

I want to turn this into:

<p>this preposition appears in {{GHGbible-ref|book=Gn|chapter=2|verse=3}} and also in {{GHGbible-ref|book=Ru|chapter=4|verse=5}} and {{GHGbible-ref|book=Ru|chapter=4|verse=5}}.</p>

... but the previous regex turns this into

<p>this preposition appears in {{GHGbible-ref|book=Gn|chapter=2|verse=3}} and also in {{GHGbible-ref|book=Ru|chapter=4|verse=5}} and {{GHGbible-ref|book=Gn|chapter=4|verse=5}}.</p>

If you don't spot the difference, then it's in the book part: I want 'Ru' and not 'Gn', i.e. the book value of the last GHGbible-ref before (\d+):(\d+), not the first one.

I probably have to modify the (.+) part so it doesn't match GHGbible-ref, or something like that, but i can't think of a way to do it.

Thanks in advance.

Replies are listed 'Best First'.
Re: matching any string except a regex
by almut (Canon) on Mar 04, 2010 at 12:45 UTC

    Maybe you could simply add a .* at the beginning of your pattern:

    s/(.*\{\{GHGbible-ref... ^^

    This would gobble up everything upto the last {{...}} thingy before the 4:5, so you'd capture 'Ru' instead of the 'Gn' from the first {{...}}.

      Thanks a lot! It seems to do what i need.

      If you have friends who study college-level Hebrew, they probably know Gesenius. Now you can tell them that you contributed to the improvement of its online version.

Re: matching any string except a regex
by Fletch (Bishop) on Mar 04, 2010 at 14:07 UTC

    This is probably on the borderline of what should be attempted with just regexen. The approach I'd take would be:

    • Use HTML::TreeBuilder to handle parsing the HTML
    • look_down the tree for p elements
    • Get the text contents of each of those elements and tokenize it on whitespace (i.e. using split)
    • Walk over the list of tokens, ignoring simple text, remembering the last mentioned book for wikirefs, and replacing textual references with the corresponding wikirefs
    • join the resulting items back together and replace the element contents

    When the regex going gets tough, the tough get writing simple parsers and it's not so tough any more.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.