comment on

Some of the comments in a node about a regex problem got me to thinking about the maintainability of regexes, versus alternate solutions. The regex in question, after some patching (with heartfelt thanks to Dermot and others for mega-help), looks like the following:

        $data =~ s/
                    (                     # Capture to $1
                        <a\s              #     <a and a space charact
+er
                        (?:               #     Non-capturing parens
                            [^>](?!href)  #         All non > not foll
+owed by href
                        )*                #         zero or more of th
+em
                        .?
                        href\s*           #     href followed by zero 
+or more space characters
                    ) 
                    (                     # Capture to $2
                        &\#61;\s*         #     = plus zero or more sp
+aces
                        (                 # Capture to $3
                            &[^;]+;       #     some HTML character co
+de (probably " or ')
                        )?                #     which might not exist
                        (?:               #     Non-grouping parens
                            .(?!\3)       #     any character not foll
+owed by $3
                        )+                #     one or more of them
                        .?
                        (?:
                            \3            #     $3 
                        )?                #     (which may not exist)
                   )
                   (                      # Capture to $4
                        [^>]+             #     Everything up to final
+ >
                        >                 #     Final >
                   )
                 /$1 . decode_entities($2) . $4/gsexi;
[download]

Note that the regex is complicated enough that I've even indented the comments to help some poor programmer behind me maintain it. As it turns out, it still has two very subtle problems (which are irrelevant to this discussion) which arise only under rare circumstances. How would you even find those problems? Heck, if I were really evil, I could put the regex on one line and make the task virtually impossible for the average programmer:

$data =~ s/(<a\s(?:[^>](?!href))*.?href\s*)(&\#61;\s*(&[^;]+;)?(?:.(?!
+\3))+.?(?:\3)?)([^>]+>)/$1.decode_entities($2).$4/gsei;
[download]

When I made the original post, tilly pointed out right away that he wouldn't use a regex to solve the problem (gasp!). That got me to thinking: since I love regex, I tend to employ them a lot. They're fast (if properly written), but many programmers don't grok them. Heck, even some of my simpler regexes are complicated:

$number =~ /((?:[\d]{1,6}\.[\d]{0,5})|(?:[\d]{0,5}\.[\d]{1,6})|(?:[\d]
+{1,7}))/;
[download]

That one just guarantees that a user-entered number fits my format. Aack!

tilly's comment, however, got me to thinking: how do Perlmonks create maintainable regexes, or do they avoid them in favor of more obvious solutions? I pride myself on writing clear, maintainable code with tons of comments. My beloved regexes, however, are the fly in my ointment of clarity. How do YOU deal with this?

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

In reply to Regexes vs. Maintainability by Ovid

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.