Really Long if/elsif/else blocks

throop has asked for the wisdom of the Perl Monks concerning the following question:

Brethren,

Are there any good tips on breaking up really long if / elsif / else blocks?

Case in point. My code performs natural language processing. There's a routine driveBase. It takes a token (essentially, an English word or some other delimited string.) Given a token it doesn't recognize, it recursively snips off prefixes and suffixes, looking for a recognized token. Once it's found one, it deduces the part-of-speech of the original token. It records this – plus some other bookkeeping taken from the base token.

Thing is, English is real irregular and there is a long list of special cases. So I've got the following code, with 29 clauses in the if / elsif / else.

 
use strict;
# driveBase drives single tokens to baseform.  Its returned value is
# true for success.  It will always inscribe $wordCache{tok}{group} if
# it isn't already done.

sub driveBase{
    my($tok, $Args) =@_;
    return $wordCache{$tok} if $wordCache{$tok}{inscript};
    my $checkPrefix  # Check the prefix unless explicitly told not to.
        = defined($Args->{checkPrefix}) ? $Args->{checkPrefix} : 1;  
    my $skipPrefix = ! $checkPrefix;
    my($base,$atail,$ntail) = &split_token($tok); 
    my($sing, $unprefBase, $savgrp, $savbase);
    if($atail and $ntail ne ''){       # If there's both an atail and 
+an ntail.
        &driveBase("$base,$atail", $Args);   # The base,tail can be a 
+different group than the base.
        &equivWord($tok, "$base,$atail", 
                   {ntail => $ntail})}        # Treat GYROSCOPE,CONTRO
+L MOMENT,3 same as GYROSCOPE,CONTROL MOMENT
   #  Assume that if $wordCache{$base,,$ntail}{group}, it's the same a
+s $wordCache{$base}{group}
    # E.g. if SOFTGOODS LAB is a type of LAB, then it has the same gro
+up as LAB, unless we're told otherwise.
    elsif(!$atail and $ntail ne ''){          # Has a numeric tail but
+ no alpha tail.
        &driveBase($base, $Args);
        &equivWord($tok, $base,
                   {ntail => $ntail})}
    elsif($wordCache{$base} and $wordCache{$base}{group}
          and $wordCache{$base}{group} ne 'UNKNOWN'
          and $base ne $tok){
        unless(member($wordCache{$base}{group}, qw(TIMEPERIOD PARTNO R
+ANGE EXTENSION))
               or  $Args -> {nocomplain}){
            warn_once('driveBase', "WordCache should have been filled 
+in when we asserted '$atail' isa type of '$base' ")};
        &equivWord($tok, $base, {atail => $atail});
        1}
    elsif(not ($atail eq '' and $ntail eq '')){  # There's either an a
+tail or ntail but the form hasn't been seen before.
        &driveBase($base, $Args);
        &equivWord($tok, $base, {atail =>$atail, ntail=>$ntail})}
    elsif($checkPrefix and          # This is the main check for prefi
+xed forms
          &is_prefixed_partspeechAnnotated($base)){  # wordCache{base}
+ gets filled in.
        unless($tok eq $base){      #$prefixLessFm Mostly, base WILL e
+q tok, but catch the odd case.
            &equivWord($tok, $base, {atail => $atail})};
        1}
    # If $checkPrefix is false, then we've already stripped a prefix, 
    #  so it can't be an abbreviation, a number or an uncannical form.
    elsif($checkPrefix and $canon{$tok} and $canon{$tok} ne $tok){ 
        &driveBase($canon{$tok}, {%$Args, checkPrefix=>0});
        &equivWord($tok, $canon{$tok}, {})}
    elsif(($sing) = $tok =~ /(.+)S$/ and $canon{$sing} and $canon{$sin
+g} ne $tok){  # So REQTS => REQT => REQUIREMENT
            &equivWord($tok, $canon{$sing}, {plural=>1})}
    elsif($checkPrefix and $abbrev{$tok}){    # Not normally invoked, 
+as abbrevs have already been expanded usually.
        &equivWord($tok, $abbrev{$tok}, {abbreviation =>1})}
    elsif($checkPrefix and 
          (&part_number_p($base) or $atail eq 'PARTNO')){
        # P/N and PN: <partno> get transformed in immediate_transforms
+; see the improves file
        &baseInscript($tok, {group => 'PARTNO', 
                             root => $tok})}
    elsif($checkPrefix and &power_loc_p($tok)){
        &baseInscript($tok, {group => 'POWERLOC', root => $tok})}
    elsif($checkPrefix and &is_loc_code($tok)){
        &baseInscript($tok, {group =>  'LOC_CODE', root => $tok})}
    elsif($checkPrefix and $savbase = &is_measure_physics_frac($tok)){
        $num_item{$tok} = 1 unless defined($num_item{$tok});
        &baseInscript($tok, {group => $savbase,
                             num_item =>1, 
                             root => $tok})}
    elsif($savbase = &is_gerund($base, 1)){
        warn_once('GroupOf2', "Gerund $tok with base $base and tails '
+$atail', '$ntail'")
            if $atail or $ntail;
        &equivWord($tok, $savbase, {group=> 'GERUND',
                                    atail => $atail})}
    elsif($savbase = &is_pastverb($base, 1)){
        &equivWord($tok, $savbase, {group=> 'VERB',  # Even if the bas
+eform is NVRB.
                                    atail => $atail,
                                    verbtense => 'PAST'})}
    elsif(&is_prefixed_verb($base, 1)){
        warn_once('GroupOf2.b', "Why wasn't $base of $tok found earlie
+r?")}
    elsif($savbase = &is_ize_verb($base, 1)){
        &driveBase($savbase, {checkPrefix => 0});    # usually a no-op
+, but every once in a while...
        &equivWord($tok, $savbase, {group=> 'VERB',  # Even if the bas
+eform is NVRB.
                                    atail => $atail})}
    elsif($sing = &is_3rdpersonverb($base, 1)){     # BOXES Could be e
+ither the plural or the 3rd person sing.
        &equivWord($tok, $sing, {s_ending => 1,
                                 atail => $atail})}
    elsif($savbase = &is_derived_noun($base, 1)){  # '-TION' words etc
+., REFUSAL, REBUTTAL, GOODNESS, BUSINESS
        &equivWord($tok, $savbase, {group => 'NOUN',  # A few NVRBS li
+ke POSITION and CONDITION will be called out explicitly.
                                    atail => $atail},
                   {no_use =>[qw(actorform)]})}
    elsif($savbase = &is_derived_adj($base, 1)){
        &equivWord($tok, $savbase, {group =>         'ADJECTIVE', 
                                    atail => $atail},
                   {no_use =>[qw(actorform)]})}
    elsif($savbase = &is_adverb($base, 1)){
        &equivWord($tok, $savbase, {group => 'ADVB',  
                                    atail => $atail})}
    elsif($types{$base}){                # presumably this token will 
+get gobbled
        warn_once('GroupOf2.c',         # by one of the neighboring to
+kens.
                  "Why does modifier $base have a tail $atail") if $at
+ail;
        &baseInscript($tok, {group =>  'MODIFIER', root => $tok})}    
+     
    elsif($sing = &singularOfNoun($tok, 1)){
        &equivWord($tok, $sing, {plural=>1})}
    elsif($base =~ /^[a-z,A-Z]$/){        
        &baseInscript($tok, {group =>  'LETTER', root => $tok})}
    # if $checkPrefix is false, then we've already stripped a prefix o
+ff the tok, so it can't be a roman numeral
    elsif($checkPrefix and (&number_p($tok) or &roman_number_p($tok)))
+{
        &baseInscript($tok, {group => 'NUMBER', root => $tok})}
    elsif(is_alphnum($base)){
        &baseInscript($tok, {group => 'ALPHNUM', root => $tok})}
    elsif($base =~ /^$punctuation_tokens$/){
        &baseInscript($tok, {group => 'PUNCTUATION', root => $tok})}
    # So COUNTERDOFFABLE gets tested as DOFFABLE and returns ADJECTIVE
    elsif($checkPrefix and 
          $unprefBase and         # Set in an earlier test, above
          $savgrp = &group_of($unprefBase, 1) and  # This time, try to
+ derive a group for the base.
          &member($metagroup{$savgrp}, 'NVRB', 'NOUN','VERB','ADJECTIV
+E','ADVB', 'GERUND')){}
    else{
        &baseInscript($tok, {group => 'UNKNOWN', root => $tok})};
    $wordCache{$tok}}
[download]

(This reads a lot better in emacs with the hilite package.)

The tests are all different. The order in which the tests are done is (sometimes) important. Some of the intermediate results from one test are re-used in a later test. The code runs great.

But the function runs 114 lines, over 7000 characters. I'd like to break it up into smaller functions (or make it smaller in some other way.) I don't expect anybody to refactor my code for me. But can anybody suggest a coding approach that is more maintainable, more compact, and still readable?

throop
==============================

Update: I considered using dispatch tables, but they didn't seem to me to fit this problem, because

I wanted tight control over the order of tests
Many of my tests involve function calls, not just eq tests
Intermediate results from earlier tests are saved in 'my' variables and used by later tests.
Or am I missing something?
throop

Comment on Really Long if/elsif/else blocks Download Code

Replies are listed 'Best First'.
Re: Really Long if/elsif/else blocks by MidLifeXis (Monsignor) on May 30, 2008 at 18:38 UTC
I wanted tight control over the order of tests use an array or arrayref to store your list of tests. Many of my tests involve function calls, not just eq tests a dispatcher can use a coderef (anonymous sub) to determine if the dispatch is to be used. Intermediate results from earlier tests are saved in 'my' variables and used by later tests. use a state variable of some sort that you pass to each dispatch selector. This would then hold your intermediate values. Then again, I tend to oversimplify things. --MidLifeXis	[reply]
Re^2: Really Long if/elsif/else blocks by throop (Chaplain) on May 30, 2008 at 19:32 UTC
Thanks. But I'm still puzzled. The array, the dispatcher, the state variables. I'm having a really hard time imagining how that would all come together, especially in a way that would be as easy to follow as the original code. Here's a simplified, but still tangled, if/elsif/else block. Could anybody take a gander at what the code would look like with the array, the dispatcher and the state variables? Read more... (696 Bytes)	[reply] [d/l]
Re^3: Really Long if/elsif/else blocks by MidLifeXis (Monsignor) on May 30, 2008 at 20:19 UTC
I'll give it a shot.... Read more... (1036 Bytes) What is nice, is that you can take this to the next level, and break some of your related tests into modules, and write the above in a manner that would allow those modules to register new checks (put `$checks` setup into `registerCheck()` and the loop into `runChecks`). `registerCheck(test => sub {...}, action => sub {...});` [download] I have used this method for testing to see if lists of documents and attributes meet certain criteria, saving state for the entire run, and reporting not only on individual items, but the results of the entire run. The logic looks very similar to what I have above. --MidLifeXis	[reply] [d/l] [select]
Re^4: Really Long if/elsif/else blocks by perreal (Monk) on May 30, 2008 at 22:10 UTC
Re: Really Long if/elsif/else blocks by zentara (Cardinal) on May 30, 2008 at 18:10 UTC
Howto avoid large hard-coded else-if function calls with a dispatch table? I'm not really a human, but I play one on earth CandyGram for Mongo	[reply]
Re^2: Really Long if/elsif/else blocks by Fletch (Bishop) on May 30, 2008 at 18:32 UTC
Along the same lines but going the OOP-y route would be to hide the differing behavior behind a method (similar to the idea behind the replacing the conditional with polymorphism refactoring). Update: although after actually reading the code in detail (`Wall of Text crits you for 195202. You die.` :) I see why you've discounted dispatch tables (and for similar reasons my original make-it-a-method wouldn't be a great idea either since there's not a clear hierarchy). I almost want to say that this makes me want to look for something more along the lines of a state machine with separate subs implementing the different branches, but then again I'm drawing a blank on an immediate way to implement it. Not that a state machine would decrease the checks and conditionals, just that it might help break up the flow with more meaningful names (e.g. start with `check_for_atail_and_ntail` which either handles it or punts to `check_for_noatail_with_ntail`). Hrmmm . . . The cake is a lie. The cake is a lie. The cake is a lie.	[reply] [d/l] [select]
Re: Really Long if/elsif/else blocks by apl (Monsignor) on May 30, 2008 at 19:25 UTC
One suggestion I'd make (if you can't use routing tables) would be to change: elsif($checkPrefix and # This is the main check for prefi +xed forms &is_prefixed_partspeechAnnotated($base)){ # wordCache{base} + gets filled in. unless($tok eq $base){ #$prefixLessFm Mostly, base WILL e +q tok, but catch the odd case. &equivWord($tok, $base, {atail => $atail})}; 1} # If $checkPrefix is false, then we've already stripped a prefix, # so it can't be an abbreviation, a number or an uncannical form. elsif($checkPrefix and $canon{$tok} and $canon{$tok} ne $tok){ &driveBase($canon{$tok}, {%$Args, checkPrefix=>0}); &equivWord($tok, $canon{$tok}, {})} elsif(($sing) = $tok =~ /(.+)S$/ and $canon{$sing} and $canon{$sin +g} ne $tok){ # So REQTS => REQT => REQUIREMENT &equivWord($tok, $canon{$sing}, {plural=>1})} elsif($checkPrefix and $abbrev{$tok}){ # Not normally invoked, +as abbrevs have already been expanded usually. &equivWord($tok, $abbrev{$tok}, {abbreviation =>1})} elsif($checkPrefix and (&part_number_p($base) or $atail eq 'PARTNO')){ # P/N and PN: <partno> get transformed in immediate_transforms +; see the improves file &baseInscript($tok, {group => 'PARTNO', root => $tok})} elsif($checkPrefix and &power_loc_p($tok)){ &baseInscript($tok, {group => 'POWERLOC', root => $tok})} elsif($checkPrefix and &is_loc_code($tok)){ &baseInscript($tok, {group => 'LOC_CODE', root => $tok})} elsif($checkPrefix and $savbase = &is_measure_physics_frac($tok)){ $num_item{$tok} = 1 unless defined($num_item{$tok}); &baseInscript($tok, {group => $savbase, num_item =>1, root => $tok})} [download] to `elsif( $checkPrefix ) { CheckPrefaces(); }` [download] Where CheckPrefaces would be a new routine that contained: if(is_prefixed_partspeechAnnotated($base)){ # wordCache{base} gets + filled in. unless($tok eq $base){ #$prefixLessFm Mostly, base WILL e +q tok, but catch the odd case. &equivWord($tok, $base, {atail => $atail})}; 1} # If $checkPrefix is false, then we've already stripped a prefix, # so it can't be an abbreviation, a number or an uncannical form. elsif($canon{$tok} and $canon{$tok} ne $tok){ &driveBase($canon{$tok}, {%$Args, checkPrefix=>0}); &equivWord($tok, $canon{$tok}, {})} elsif(($sing) = $tok =~ /(.+)S$/ and $canon{$sing} and $canon{$sin +g} ne $tok){ # So REQTS => REQT => REQUIREMENT &equivWord($tok, $canon{$sing}, {plural=>1})} elsif($abbrev{$tok}){ # Not normally invoked, as abbrevs have a +lready been expanded usually. &equivWord($tok, $abbrev{$tok}, {abbreviation =>1})} elsif((&part_number_p($base) or $atail eq 'PARTNO')){ # P/N and PN: <partno> get transformed in immediate_transforms +; see the improves file &baseInscript($tok, {group => 'PARTNO', root => $tok})} # etc., etc. etc. [download] That is, remove the constant test from the secondary conditions.	[reply] [d/l] [select]
Re^2: Really Long if/elsif/else blocks by throop (Chaplain) on May 30, 2008 at 20:12 UTC
Well.. it would look cleaner. But I don't think it would fall through correctly. You see, if none of those `elsif($checkPrefix and` whatever`)` tests succeed, it still needs to fall through to the following tests, even if $checkPrefix were true.	[reply]
Re^3: Really Long if/elsif/else blocks by graff (Chancellor) on May 31, 2008 at 04:56 UTC
I tend to agree with apl on this point (and I also agree with the person who suggested changing your indent/line-break style). Apart from that, I was struck by the variety of ways in which you phrase similar conditions, and by the seemingly arbitrary ordering of conditions. If your "split_token()" sub is returning a 3-element list whose members vary significantly among all of: /"undef", "defined but empty string", "non-empty string but evaluates to 0", "non-empty, non-zero value"/, I think you should be making that more clear. When I see conditions like: `if ( $atail and $ntail ne '' ) { ... elsif ( !$atail and $ntail ne '' ) { ... elsif ( not ($atail eq '' and $ntail eq '' )) {` [download] I can't help thinking that it should probably be: `if ( $atail and $ntail ) { ... elsif ( $atail ) { ... elsif ( $ntail ) { ...` [download] If it needs to be different from that, you should be really clear about why it needs to be different. I was puzzled by the last elsif condition: `elsif($checkPrefix and $unprefBase and # Set in an earlier test, above $savgrp = &group_of($unprefBase, 1) and # This time, try to +derive a group for the base. &member($metagroup{$savgrp}, 'NVRB', 'NOUN','VERB','ADJECTIV +E','ADVB', 'GERUND')){}` [download] When I looked for "$unprefBase" above that point, I found it only once, near the top of the sub: `my($sing, $unprefBase, $savgrp, $savbase);` [download] Nothing is ever assigned to $unprefBase, so the following two parts of that last elsif condition will never be evaluated -- and the block itself is empty anyway, so why have that condition at all? (That's the first thing I'd get rid of, if the idea is to shorten the list of elsif blocks.) You might want to set up some test data, such that you believe it should exercise every block, and then see whether it really does get into every block. I think the issue is not so much to shorten the list of elsif conditions, but rather to organize them into something like a coherent order and structure, so that sequential dependencies, and the extra assignment steps that affect outcomes, are more easily (and logically) identifiable as such. Another point: at the top you say "driveBase returns true for success...", but then it contains recursive calls to driveBase where the return value is not checked. In fact, since "wordCache{$tok}" is the last statement in the sub, it will only return false if this hash element is 0, empty string, or undef. And what would that mean, exactly?	[reply] [d/l] [select]
Re^3: Really Long if/elsif/else blocks by apl (Monsignor) on May 31, 2008 at 10:54 UTC
Good catch. The new sub should be a function then that would return a 1 by default, but a 0 if none of the conditions were met (i.e. the final else would set it). Then the conditional in the original code would be `elsif($checkPrefix and CheckPrefaces() ) { } # do nothing; processing was handled in CheckPrefaces` [download] It looks a trifle odd, but I still prefer factoring out a related block of code. Everything works fine in the original body of code, but someday it will have to be modified, and the supporting programmers eyes will glaze over...	[reply] [d/l]
Re: Really Long if/elsif/else blocks by educated_foo (Vicar) on May 31, 2008 at 03:04 UTC
A less repulsive bracketing style would help, i.e. one with a bit more space to it. Instead of `if($foo){ $blah} elsif($bar){ $baz}` [download] something a bit more normal like `if ($foo) { $blah } elsif ($bar) { $baz }` [download] or `if ($foo) { $blah } elsif ($bar) { $baz }` [download] makes it easier to see where blocks begin and end. Also, you should use an editor that lets you move between matching delimiters, e.g. "%" in vi or "C-M-f" and "C-M-b" in Emacs.	[reply] [d/l] [select]
Re^2: Really Long if/elsif/else blocks by mandarin (Hermit) on May 31, 2008 at 20:29 UTC
Perl Best Practices by Damian Conway (O'Reilly, 2005) clearly states: "Don't cuddle an else". The recommended way is `if ( $foo ) { $blah; } elsif ( $bar ) { $baz; } else { $boom; }` [download]	[reply] [d/l]
Re^3: Really Long if/elsif/else blocks by MidLifeXis (Monsignor) on May 31, 2008 at 22:47 UTC
While PBP contains many good recommendations, that is what they are. They are not law, requirements, or to be followed blindly. --MidLifeXis	[reply]
Re^3: Really Long if/elsif/else blocks by educated_foo (Vicar) on Jun 01, 2008 at 03:31 UTC
So what? There are many styles that allow one to easily line up sequences of if/elsif/.../else. "Person X does this, therefore X is the only way" is poor reasoning.	[reply]
Re^3: Really Long if/elsif/else blocks by Anonymous Monk on Jun 01, 2008 at 12:05 UTC
one of the most petty things included in said book, that he gets the most heat for (its like spaces v. tabs )	[reply]
Re^3: Really Long if/elsif/else blocks by parv (Parson) on Jun 01, 2008 at 08:05 UTC
Nope, there is only One True Way ... `if ( ... ) { ... } else { ... }` [download] `;>}` ... but really <what others have already stated>.	[reply] [d/l]
Re: Really Long if/elsif/else blocks by pc88mxer (Vicar) on May 30, 2008 at 22:45 UTC
Case in point. My code performs natural language processing. There's a routine driveBase. It takes a token (essentially, an English word or some other delimited string.) Given a token it doesn't recognize, it recursively snips off prefixes and suffixes, looking for a recognized token. Once it's found one, it deduces the part-of-speech of the original token. It records this � plus some other bookkeeping taken from the base token. 1) This sounds very much like a search task for which a language like Prolog would work well. I'm not saying you need to write it in Prolog, but pondering how you might do it in a logic programming language might be helpful. 2) I think the ultimate way to do this is to just look up the answer (from a database.) Of course, you still have to first generate (or locate an existing) database, and this code can help to build it. I doubt, however, that you'll ever be able to come up with a complete set of rules, and eventually you'll need to "manually specify" what the answer is for certain input words. 3) Do you know about Wordnet? It's a lexical database of English which is advertised as being useful for natural language processing.	[reply]
Re: Really Long if/elsif/else blocks by Zen (Deacon) on May 30, 2008 at 21:46 UTC
If the problem you're trying to solve is readability, dispatch tables are quite valid. You can still iterate over conditions, check return values, and so on, except in a fashion that is a lot clearer than what you have now. What is the magic bullet you are looking for? You have a lot of refactoring to do, no simple way out.	[reply]
Re: Really Long if/elsif/else blocks by doom (Deacon) on May 31, 2008 at 00:04 UTC
Okay, well it looks like you've done the obvious first thing: you've more-or-less broken out every if clause as a subroutine call. So your structure of logical tests may be a mess, but all they do is decide which sub to call. A minor readability hint: drop the leading "&" on sub calls. It's gone out of style, unless you want the old-fashion calling semantics that automatically passes through @_. As for cleaning up the logical tests, I don't know what to suggest except that you should write an extensive series of automated tests (e.g. using Test::More and friends) so then you can try experimental refactoring and know immediately if you've broken something. (By the way, there is a AI::Prolog module, so you can do Prolog type stuff from within perl. Could be fun to play with, though I'm not sure it's ready for "production use".)	[reply]
Re: Really Long if/elsif/else blocks by John M. Dlugosz (Monsignor) on May 31, 2008 at 01:40 UTC
You might look into the given/when statement now available in 5.10. That is not a substitute for refactoring, but might make the existing logic somewhat cleaner, as a first step.	[reply]
Re: Really Long if/elsif/else blocks by Starky (Chaplain) on Jun 05, 2008 at 06:06 UTC
If I interpret what you're trying to do correctly, I can't believe no one has pointed you to Parse::RecDescent, a module used for parsing language, whether a programming language or natural language. You specify a set of rules that define language tokens in what's called a grammar, and then for various components of the grammar, you can identify callbacks (called "actions" in Parse::RecDescent) that specify what should be done when a token is encountered. The module then recursively parses a string based on your grammar and executes any defined actions as tokens are encountered. It's one of the gems of the CPAN. It's incredibly powerful and would clean up your logic and syntax considerably, though as an incredibly powerful and sophisticated tool, it is also complex. There is a learning curve, and you'll find that it will take you at least a few days to perhaps a couple of weeks to become comfortable with it and several weeks to inculcate its more esoteric functionality. But, again, if I interpret what you're trying to do correctly, trust me, the learning curve is well worth it in the long run.	[reply]