Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^5: Documenting non-public OO components

by dragonchild (Archbishop)
on Sep 09, 2005 at 01:41 UTC ( [id://490400]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Documenting non-public OO components
in thread Documenting non-public OO components

Building an inverse index is just complicated, period. Either you have to write functions which are so complex that they ought to be refactored (Kinosearch) and Dragonchild presumably disapproves, or you factor out the functionality so that data passes through several classes and 10 or 20 methods (Lucene) and Dragonchild definitely disapproves. Catch-22. :)

Many tasks are just plain complicated on their face. Writing an SQL generator that takes into account arbitrary schemas, arbitrary constraints, and selective denormalization, then builds the correct optimized SQL is hard. It is correctly broken out into functional areas. One of those areas is the use of a directed acyclic graph (DAG) to represent the schema. I certainly didn't write that code (though I ended up rewriting it to suit my needs). But, that was a conceptual black-box interface. Although I know nothing about inverse indices, I'm pretty sure that, like all other CS problems, it decomposes quite nicely into areas that are generalizable. Anything that's generalizable is a conceptual interface that is another distribution.

Yes, your data is going to pass through different distros, and that's ok. The big thing to focus on is the idea of separation of concerns. My SQL generator didn't need to know how a DAG worked, just that if you hit the red button while pulling the green lever, the blue light will come on. I suspect that there's a lot of stuff on CPAN you can certainly reuse, reducing your coding (and testing) burden.

IMO, developers who work this way are not terrorists who hate our freedom and are out to destroy our way of life. Good software which does useful stuff can be written in many ways.

Absolutely (*grins*) true on both points. However, read my signature. Good software can come from many places, but it has two very basic criteria - it works and someone else can safely modify it. If it doesn't meet those two things, it isn't good software. And, frankly, that is an absolute.

Now, how do we meet these criteria? Well, the most efficient way (thus far) is TDD. How do you do TDD with a complex system? By mocking up your interfaces. You test your intermediate items by mocking up their dependencies. Then, you have some system-level tests which exercise the system as a whole without mocks, and you're 80-90% of the way there.

How do you avoid tests for the intermediate processes which rely on the innards of what ought to be a series of black boxes? The answer is... there's no easy answer. Mock objects help. But there's going to be a fair amount of waste...

Waste? I don't see waste. Yes, you will have to keep your mock objects current with the spec of your intermediate sections. However, while specs may grow, and quickly at times, they should change items very slowly. Otherwise, it's an incompatible interface change which should happen either with a new major version or in alpha software. Anything else is churn which screws you up no matter what paradigm(s) you're using.

. . . the recommendations are inappropriate for my current task, which is porting an existing library not designed from the ground up according to the principles you set out.

You're rewriting the library from scratch, period. You're using a different language and you have access to different libraries and language features. You may be preserving the same API and functionality, but it's still a complete rewrite. Porting the docs is all well and good, but you still need tests written against the spec that your code will first fail against, then pass as you write the minimum necessary. I have no experience with Lucene, but I am 100% positive that there is cruft in that codebase. By rewriting against the spec using the existing codebase as a reference, you most certainly can use TDD. In addition, you can probably end up with several new distros to add to CPAN that aren't directly usable solely for reverse indexing.

I'm doing a very similar project in Javascript, porting the Prototype library to JSAN. Instead of just throwing it up there, I'm converting the innards to JSAN distributions, cleaning them up and renaming them. I'm also keeping a compatibility layer so that existing users of Prototype (such as Ruby-on-Rails and Catalyst) can convert over to JSAN with little change in their existing codebase while taking advantage of the better code. I suspect you will find that you can do the same.


My criteria for good software:
  1. Does it work?
  2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

Replies are listed 'Best First'.
Re^6: Documenting non-public OO components
by creamygoodness (Curate) on Sep 09, 2005 at 04:49 UTC
    Waste? I don't see waste.

    The thing I found most vexing was that mocking up objects which contained arbitrary binary data was brain-bending and time-consuming. Let's say I want to write a deserialization method. We'll follow TDD and write a failing test first.

    package Foo; sub get_data { $_[0]->{data} } package main; use strict; use warnings; use Bar; my $blackbox = bless { data => "\x3foo\x3bar\x3bazasdfasdfasdfasdf", }, Foo; my $object = Bar->new; my $deserialized = [ $object->deserialize( $blackbox ) ]; is_deeply($deserialized, [ qw(foo bar baz) ], "Deserializer decodes correctly");

    FYI, in order to write that, I had to go look up BER compressed integers and see how the byte-level algorithm worked. Let's hope I got it right.

    Here's the actual code, now that I am allowed to type it.

    sub deserialize { my ($self, $blackbox) = @_; # capture up to 16-byte random sentinel. $blackbox->get_data =~ s/ (.*?) (?: $self->{record_separator} | $ ) //xsm or confess("no match"); return unpack("(w/a)*", $1); }

    Now, even if I risk displeasing the gods of TDD and cheat by typing the code I'm actually going to use before writing my test, it's still a pain to generate this intermediate data. And if I decide to experiment with another algorithm, I wind up throwing away that hard-won mock data, as it's rare that it transmutes easily. Ironically, tests like these are tightly coupled to the code they test, which makes them brittle and difficult to adapt or reuse.

    You're rewriting the library from scratch, period.

    Credit where it's due: Plucene was originally written over a year ago, as a port of Lucene 1.3. The problem is this:

    # time to index 1000 documents: Plucene 1.25 276 secs Kinosearch 0.021 88 secs Kinosearch 0.03_02 35 secs Java Lucene 13 secs

    I'm now working on a port of the current version of Lucene (essentially 1.9, not yet officially released), leveraging what I learned by reinventing the wheel with Kinosearch.

    The same problems of dealing with arbitrary binary data arise, though since this is a port and not an alpha, I won't have to continually rewrite tests as I would have had to (if I'd followed TDD) when I was writing Kinosearch. Perhaps you can suggest an alternative technique for creating the mock objects? You can't algorithmically generate this data; even if you could live with large copy and paste ops, too many dependencies are involved to pull it off.

    In addition, you can probably end up with several new distros to add to CPAN that aren't directly usable solely for reverse indexing.

    That's where Sort::External came from.

    Best,

    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com
      The thing I found most vexing was that mocking up objects which contained arbitrary binary data was brain-bending and time-consuming. . . . FYI, in order to write that, I had to go look up BER compressed integers and see how the byte-level algorithm worked. Let's hope I got it right. . . . it's still a pain to generate this intermediate data. And if I decide to experiment with another algorithm, I wind up throwing away that hard-won mock data, as it's rare that it transmutes easily. Ironically, tests like these are tightly coupled to the code they test, which makes them brittle and difficult to adapt or reuse.

      Yeah, that is true. However, I think you have a decomposition opportunity here that you're not taking advantage of. I would refactor out these nasty detail sections into their own subroutine/object/whatever that provides both serialization and deserialization for the same binary format/algorithm/whatever. The tests are then applied against that specific unit of work. Then, you choose which unit of work you want to use and you know that it simply works. That avoids throwing away that hard-won mock data, as you so eloquently put it. Now, I'm not a TDD Nazi, by any means. I just think that many people throw out the baby with the bathwater when it comes to TDD.

      Perhaps you can suggest an alternative technique for creating the mock objects? You can't algorithmically generate this data; even if you could live with large copy and paste ops, too many dependencies are involved to pull it off.

      I think you're lost in a maze of twisty dependencies, all alike. You can't see the forest for the trees. I think you need to start building a toolbox of nasty binary bits that you can assemble without having to worry about how they got built.

      Frankly, I think you're worried about performance too much too early. You're coupling things together that may not need to be coupled together, in order to squeeze performance out that may not need to be squeezed from where you're looking. I'd build the system in its ideal form, then profile it. The slow spots are never where you think they are.


      My criteria for good software:
      1. Does it work?
      2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://490400]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-03-29 15:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found