Interactive scripting with debugger

My sampling may be very skewed but I am surprised at how few Perl programmers in my direct acquaintance take advantage of the Perl debugger as a coding (as opposed to debugging) tool. In combination with Emacs's shell mode it enables me to write my scripts interactively. Maybe there are better alternatives to what I illustrate below (and I am eager to learn of them), but if not, I hope at least some of you will find the following technique useful.

NB: the Emacs stuff below is not essential to the technique I want to illustrate here, just convenient; all the interaction with the Perl debugger can be done directly by invoking it from a regular shell instead of an Emacs shell. On the other hand, I have never run the Perl debugger on Windows, so I can't say how much of what I illustrate below applies there.

For example, suppose that I want to write a script to munge a large-ish text file, mongo.tab, whose structure/constraints are not entirely clear to me. I know that its first line contains headers, and that the fields on each line are separated by tabs, but I still have questions such as, are the entries in the first column unique?; does this or that regular expression capture all rows I am interested in?; are there empty cells, and if so what fraction of all the cells are these? Etc.

So I start by writing the first part of the script:

use strict;
use warnings;

chomp( my @lines = do { local @ARGV = 'mongo.tab'; <> } );
my @headers = split /\t/, shift @lines;
my @records = map [ split /\t/ ], @lines;
1;
[download]

This gets me to the point where all the lines have been reduced to records of fields, and I'm ready to do some exploring. (The last line, consisting of only "1;" is a "breakpoint hook" for the debugger, as I'll show in a minute.)

Then, right from within my editor, Emacs, I split the window in top and bottom halves (C-x 2), switch to the lower one (C-x o), start a shell interaction buffer (M-x shell) (the short script listed above remains in the top half), and finally fire up the Perl debugger, right in the Emacs shell buffer, giving it my newborn script as fodder:

% perl -d munge.pl
Loading DB routines from perl5db.pl version 1.23
Editor support available.

Enter h or `h h' for help, or `man perldebug' for more help.

main::(munge.pl:4):    chomp( my @lines = do { local @ARGV = 'mongo.ta
+b'; <> } );
  DB<1>
[download]

The debugger (affectionately known as DB) shows the first executable line of my script and waits for my instructions. In this case I am not interested in debugging my code (I know it's flawless :-) ); I just want to get to the point that I can use Perl interactively to explore the nature of the data I'm dealing with. Therefore, I just use the command "c 7" (c being short for "continue until line number ") to tell DB to go ahead and let the script execute, but stop it at line 7, where I had previously placed the "breakpoint hook" I mentioned earlier. This is basically a "no-op" executable line where the DB can stop my script after all the lines I am interest in have executed.

After a few seconds of digestion, DB tells me where the script has been stopped, and gives me another prompt:

  DB<1> c 7
main::(munge.pl:7):    1;
  DB<2>
[download]

OK, time to find out what we got. First, how many rows and columns do we have? To do this I use the p command (short for print I suppose) to print out the sizes of @records and @headers, for the numbers of rows and columns, respectively:

  DB<2> p scalar @records
16215
  DB<3> p scalar @headers
118
  DB<4>
[download]

OK, about 16K rows and about 100 columns. Let's see if the entries in the first field are unique. To do this I use the entries in this field as keys for a hash, %h, and check whether the number of keys in this hash is equal to the number of records:

  DB<4> $h{ $_ }++ for map $_->[ 0 ], @records
  DB<5> p scalar keys %h
16215
  DB<6>
[download]

The number matches the number of rows we got earlier, meaning that the entries in the first field are indeed unique, otherwise the number of keys in %h would have been smaller than the number of rows (or equivalently the number of records in @records).

Now let's see if the entries in the second field are unique (I happen to know that it is supposed to be a "near synonym" of the first field); we repeat the same trick, which is facilitated by the fact that my DB has readline and history enabled, so I can just step back through my interaction history to get the next-to-last line, and then I can edit that just like I would edit any other line inside an Emacs buffer (if readline and history are enabled this is possible even if the DB session was initiated from any shell). To step back through the history I use M-p (if I had started the DB session from a regular Unix shell such as bash, tcsh, or zsh I would use C-p to step back through the history, but this key combination has a different meaning inside an Emacs buffer, which is the context of the current interaction). OK, so I make some minor changes in my previous line to test the uniqueness of entries in the second field:

  DB<6> $h2{ $_ }++ for map $_->[ 1 ], @records
  DB<7> p scalar keys %h2
9027
[download]

Aha! The entries in the second field are not unique. Let's find out which entry appears most often in the second column. I'll sort the keys of %h2 descendingly by the corresponding values (which record the numbers of times each key was encountered when the hash was initialized):

  DB<8> p ( sort { $h2{ $b } <=> $h2{ $a } } keys %h2 )[ 0 ]

  DB<9>
[download]

Huh? Nothing? It looks like the most common key may be the empty string; let's check:

  DB<10> p $h2{''}
2904
  DB<11>
[download]

OK, so we have almost 3K empty entries in the second column. That's not terribly interesting; what about the second most common entry in the second column? Same trick: I sort descendingly, but this time I pick out the item that comes up in second place:

  DB<12> p ( sort { $h2{ $b } <=> $h2{ $a } } keys %h2 )[ 1 ]
"BBX "
[download]

Hey, waitaminnit! What's that space doing there, after 'BBX'? There is supposed no leading or trailing whitespace in all these entries. It looks like someone goofed at the time of generating the file. No matter, we have to deal with it.

So I switch back to the top half of my editor window and fix the regexp used for splitting the records into fields:

my $re = qr/ *\t */;
my @headers = split /$re/, shift @lines;
my @records = map [ split /$re/ ], @lines;
[download]

I define a regexp object $re that I can use in both splits. Note that I don't define it as /\s*\t\s*/ because this would give me incorrect splitting when a line contained empty fields.

With this change in place, I re-parse the file by restarting the script with the R command (short for Restart):

  DB<13> R
Warning: some settings and command-line options may be lost!

Loading DB routines from perl5db.pl version 1.23
Editor support available.

Enter h or `h h' for help, or `man perldebug' for more help.

main::(munge.pl:4):    chomp( my @lines = do { local @ARGV = 'mongo.ta
+b'; <> } );
  DB<12>
[download]

...and I am ready for some more exploring.

I hope the above example gives you an idea of the power of using DB as what amounts to a "Perl shell". I've only scratched the surface, having barely illustrated only three commands c, p, and R, but this meditation is already getting a bit too long, so I better stop. For more info on DB see perldebug.

the lowliest monk

Janitored by holli - retitled from Interactive scripting with DB (1/13/0)

Back to Meditations