matching any string except a regex

by amir_e_a (Hermit)
on Mar 04, 2010 at 11:47 UTC

amir_e_a has asked for the wisdom of the Perl Monks concerning the following question:


I am proofreading the WikiSource edition of Gesenius' Hebrew Grammar, a famous grammar book of the Biblical Hebrew language.

It has many references to the Bible using a MediaWiki template in this form: {{GHGbible-ref|book=Gn|chapter=2|verse=3}}; this creates a link to Genesis chapter 2 verse 3.

This template has already been used in many places where it is needed, but i want to put it automatically everywhere else. Basically, there are many pieces of text that go like this:

<p>this preposition appears in {{GHGbible-ref|book=Gn|chapter=2|verse=3}} and also in 4:5.</p>

I want to turn this into:

<p>this preposition appears in {{GHGbible-ref|book=Gn|chapter=2|verse=3}} and also in {{GHGbible-ref|book=Gn|chapter=4|verse=5}}.</p>

I can do it using this regex:


The problem: sometimes the source text is

<p>this preposition appears in {{GHGbible-ref|book=Gn|chapter=2|verse=3}} and also in {{GHGbible-ref|book=Ru|chapter=4|verse=5}} and 4:5.</p>

I want to turn this into:

<p>this preposition appears in {{GHGbible-ref|book=Gn|chapter=2|verse=3}} and also in {{GHGbible-ref|book=Ru|chapter=4|verse=5}} and {{GHGbible-ref|book=Ru|chapter=4|verse=5}}.</p>

... but the previous regex turns this into

<p>this preposition appears in {{GHGbible-ref|book=Gn|chapter=2|verse=3}} and also in {{GHGbible-ref|book=Ru|chapter=4|verse=5}} and {{GHGbible-ref|book=Gn|chapter=4|verse=5}}.</p>

If you don't spot the difference, then it's in the book part: I want 'Ru' and not 'Gn', i.e. the book value of the last GHGbible-ref before (\d+):(\d+), not the first one.

I probably have to modify the (.+) part so it doesn't match GHGbible-ref, or something like that, but i can't think of a way to do it.

Thanks in advance.

Replies are listed 'Best First'.
Re: matching any string except a regex
by almut (Canon) on Mar 04, 2010 at 12:45 UTC

    Maybe you could simply add a .* at the beginning of your pattern:

    s/(.*\{\{GHGbible-ref... ^^

    This would gobble up everything upto the last {{...}} thingy before the 4:5, so you'd capture 'Ru' instead of the 'Gn' from the first {{...}}.

      Thanks a lot! It seems to do what i need.

      If you have friends who study college-level Hebrew, they probably know Gesenius. Now you can tell them that you contributed to the improvement of its online version.

Re: matching any string except a regex
by Fletch (Bishop) on Mar 04, 2010 at 14:07 UTC

    This is probably on the borderline of what should be attempted with just regexen. The approach I'd take would be:

    • Use HTML::TreeBuilder to handle parsing the HTML
    • look_down the tree for p elements
    • Get the text contents of each of those elements and tokenize it on whitespace (i.e. using split)
    • Walk over the list of tokens, ignoring simple text, remembering the last mentioned book for wikirefs, and replacing textual references with the corresponding wikirefs
    • join the resulting items back together and replace the element contents

    When the regex going gets tough, the tough get writing simple parsers and it's not so tough any more.

