bobf has asked for the wisdom of the Perl Monks concerning the following question:

Update: Perhaps a more appropriate title for this node is Minimizing code redundancy vs over-parameterizing

Hi,

I recently refactored a program and eliminated a lot of redundant code. Now I have a series of functions that are structured like this (simplified example):

sub foo { my ( $type ) = @_; my %data = ( type1 => { dir => 'a', text1 => 'text-a1', text2 => 'text-a2' }, type2 => { dir => 'b', text1 => 'text-b1', text2 => 'text-b2' }, ... type9 => { dir => 'i', text1 => 'text-i1', text2 => 'text-i2' } ); # do stuff with $data{$type}{dir|text1|text2} my $barresult = bar( $type ); my $bazresult = baz( $type ); } sub bar { my ( $type ) = @_; my %data = ( type1 => { url => 'a', regex => '\d\d(\d\d)(\d\d)(\d\d)' }, type2 => { url => 'b', regex => '(\d\d)(\d\d)(\d\d)' }, ... type9 => { url => 'i', regex => '\.(\d{8})\.' } ); # do stuff with $data{$type}{url} and $data{$type}{regex} } sub baz { # similar to &bar }

Initially each function was duplicated for each $type (&type1_foo, &type1_bar, &type1_baz, &type2_foo, etc) and the data now in the %data hashes were hard coded. Now the code is easier for me to read and maintain, and all the guts are hidden behind a single call to foo with only one parameter: foo( 'type1' ); However, the code portion for each function is only 20 or 30 lines, which is sometimes shorter than the %data hash that is initialized for each sub, and I am starting to feel like the subroutines are being dominated by the hashes instead of actual code.

I considered pulling all of the %data hashes out and combining them, then passing a hashref ($data{$type}), but while I think that removing the data elements from the subs they are used in will clean up the code, it will obfuscate the purpose of each data element (which would make adding new $types more difficult). I also considered converting the subs into class methods and creating instances for each $type, but for some reason that seems like overkill since all of the data elements would have to be specified as instance attributes anyway. Or should I create separate classes for each $type?

How should I structure a series of relatively short functions that each require a handful of parameters, without having to pass all the parameters around? Was I better off before refactoring? Is OO the way to go here?

Thanks!!

Replies are listed 'Best First'.
Re: Structuring code
by gaal (Parson) on Jan 06, 2005 at 08:39 UTC
    It looks like you have one registry that maps from types to dirs and texts, and to urls and regexes. Then you have different functions that operate on particular subkeys for particular types. ("Type" is a name you are using; in more abstract notation it is just another key that happens to be top-level.)

    This suggests a few things.

    1. There is one registry. Make it global.
    2. You have business-logic functions that do particular things on one type at a time. So these functions should get the type (probably by name -- that's easier), and know about the registry. Since we are making the registry global (or if this is a package, at least a lexical with the same scope as the package) you don't need to explicitly pass a reference to the registry.
    3. You need generic functions for manipulating the registry too. You'd mentioned adding a new type; maybe that's all you need right now but this approach will also let you cleanly write code to delete a type, clone it, and so on. Implement these as you need them, but the trick is--
    4. What is referred to by a type always has the same structure. This suggests a class (though you don't have to go OO all the way if you don't want to). You then write a "loader" function to translate from data to a series of objects that will be less bulky (visually, at least) than writing out the code explicitly, inline. The big win here is that you can also pull out the data from the code completely and put it in a configuration file (or a database, or whatever). When you update the structure of a type, you have to update this loader, but now you have an encapsulated object which everybody accesses through a well-defined interface, so they don't have to change their code.

    Hope this helps!

      Thanks for your comments, gaal. It sounds like you're leaning towards one of my first ideas ("pulling all of the %data hashes out and combining them [into a global hash], then passing a hashref ($data{$type})" (even if it is implemented with an OO approach).

      1. There is one registry. Make it global.
      This was my first inclination, but I was torn between that and keeping the data elements localized to the sub they are used in. I thought it would be easier to understand the code if the data was right there next to it, but maybe I'm better off keeping all the data together.

      2. ...these functions should get the type (probably by name -- that's easier), and know about the registry.
      Currently, the names of the "types" are the top-level keys to all the hashes, and that is what I pass as a parameter. If I combined all the data hashes per comment #1, I could pass $data{$type} instead. This would limit the portion of %data that is accessible by the functions to only the entries under the specified $type key, and it would eliminate the need to pass either $type or a ref to %data (although the latter would not be necessary if it were a global, as you point out).

      3. You need generic functions for manipulating the registry too. You'd mentioned adding a new type;
      I wasn't very clear about this in the OP - sorry. The different "types" (top-level keys) actually represent different data files, and the functions are used to manipulate and search through them. It is possible that new data files could be added to the list, but that would be a very rare occurance and I assumed it would be easier just to add some entries to the %data hash(es) rather than develop a set of functions to do it for me. You raise a good point, though - I'll give that some thought.

      4. This suggests a class (though you don't have to go OO all the way if you don't want to).
      I started thinking about using objects (and their advantages), but I'm still very new at OO programming and I haven't totally thought out how I could/would/should structure things. Nonetheless, I do like the idea of having the guts encapsulated behind a defined interface. I think I'll have to spend more time considering this approach.

      Thanks again for the comments. It's nice to have a few different ideas to consider.

        If you go the OO way, you (usually) *wouldn't* write your functions to get a type by name. Rather, your registry would simply become a very simple (conceptually speaking) map from names to objects; if the calling code already has the reference to the object it is working on, it simply invokes what is now a proper method on that object. The registry gives you: (a) a convenient way to get the object if you only have its name, and (b) a reference to the object that lives as long as the program does.

        If separating different parts of each object (/ different aspects of each type) looks like a concern to you, then perhaps you need to deepen you object model (that's what you do in OO in these cases). So you say each "type" is really a disk file? Then you want a relatively thin File class, that is composed of further objects: FileLocation (directory, text, what you had in your first function), and FileURL (the things you had in your second function). I don't really know if my names make sense: you're the one who knows the business logic -- why you have regexps and urls treated together, for example.

Re: Structuring code
by dimar (Curate) on Jan 06, 2005 at 09:39 UTC
    How should I structure a series of relatively short functions that each require a handful of parameters, without having to pass all the parameters around? Was I better off before refactoring? Is OO the way to go here?

    It would take knowing more specifics to address these, so here is just some basic observation. If there is no requirement that keeps you from using a plain-old run-of-the-mill table to structure your data and your code, then that will suffice (90%+ of the time). Some of the most over-engineered code out there comes from people doing very exotic and fancy things when all they really needed was a simple table with some operations on rows. Tables are simple and everyone can understand it just by looking at it. Moreover, nearly any concievable data structure can be expressed with them, thats why relational databases have been around for so long. (note: this is not a suggestion to use a database, just a comment on structure).

    Of course, not everything is amenable to this approach: nested and irregular data structures, a registry, other caveats of course apply. The point is, if all you have is a standardized series of name-value pairs that can be arranged into rows, why not keep it as simple as possible.

    ### begin_: init use strict; use warnings; use Data::Dumper; my $oDataTable = [ { 'type' => 'type1', 'text2' => 'text-a2', 'regex' => '[a-z\\d]?', 'url' => 'http://c', 'text1' => 'text-a1', 'dir' => 'a' }, { 'type' => 'type2', 'text2' => 'text-b2', 'regex' => '[a-z\\d]{0,2}', 'url' => 'http://b', 'text1' => 'text-b1', 'dir' => 'b' }, { 'type' => 'type3', 'text2' => 'text-c2', 'regex' => '[a-z\\d]{0,3}', 'url' => 'http://c', 'text1' => 'text-c1', 'dir' => 'c' }, ]; ### begin_: process the data MungeMyData($oDataTable); print Data::Dumper->Dump([$oDataTable], [qw(dataroot)]); ### begin_: the master subroutine sub MungeMyData { my $oData = shift || die"missing data"; for my $row (@{$oData}){ if ($row->{type} eq 'type1' ){ $row->{dir} = uc($row->{dir}); $row->{text1} = reverse($row->{text1}); ### munge munge munge ... }; if ($row->{type} eq 'type2' ){ $row->{dir} = lc($row->{dir}); $row->{text1} = substr($row->{text1},3); ### munge munge munge ... }; ### munge munge munge ... if ($row->{type} eq 'type9' ){ $row->{dir} = lc($row->{dir}); $row->{text1} = substr($row->{text1},3); ### munge munge munge ... }; }; @{$oData} = sort { $a->{type} cmp $b->{type}} @{$oData}; return $oData; }###end_sub ### begin_: the end 1; __END__

      Hi scooterm, thanks for the suggestions (and example code). You are correct - there really isn't any reason why I can't use a simple hash lookup table for this data (which is how I have it structured now, although it is a HoH instead), so perhaps I am trying to over-engineer things a bit.

      The example you gave (based on the OP) essentially follows the same process for each type:

      # determine type # process {dir} # process {text1} # munge munge munge...
      In my case most of the code blocks are the same, so process {dir} is the same code for all types (the only difference between them is the value of {dir}, which can be passed as a parameter). There are some instances where the code itself varies, however, which is why I have regexes as part of the data structure. More specifically, I do stuff like this:
      my @files = get_filenames( $data{$type}{dir}, $data{$type}{regex} );
      which allows me to use the same function to readdir and grep without having a nearly identical copy of it for each $type.

      I guess what it comes down to is that I'm trying to balance minimizing code redundancy with over-generalizing (and therefore over-parameterizing) things. I found myself going in circles about how to design this, so I decided it was time to consult the Monks. Sometimes simple is better than fancy. Thanks for the reality check.