Creating Minimal Subgroups in a List of Characters

periapt has asked for the wisdom of the Perl Monks concerning the following question:

OK, so here is a problem I am working on. I have an ordered list of statute numbers, along with charge level, charge degree and criminal category. The list is ordered by statute. I am looking to find the smallest grouping of a set of charges that will uniquely identify the set. For example

__DATA__
0024.118.3A F T TFF     
0024.118.3B F T TFF
0024.118.3C F T TFF
0024.118.3D F T TFF
0024.118.4  F F TFF

should (and does) produce the following groups

0024FT containing 0024.118.3A,B,C,D  TFF
0024FF containing 0024.118.4         TFF (coincidently)
the degree causes the change in grouping
[download]

The problem I'm having is in grouping when the statute has different degrees and/or categories. For example

Update: cleaned up code to make it more readable and configured it to run as a complete example

Update: Thanks to all for your comments. I have a working solution now. Go Monks ...

__DATA__
0039.04    N N OTH
0039.205.2 F T OTH
0039.205.6 F T TFF

should produce

0039NN       containing 0039.04    OTH
0039.205.2FT containing 0039.205.2 OTH
0039.205.6FT containing 0039.205.6 TFF
where the change in casetype changes the grouping

but instead the code produces

0039NN       containing 0039.04    OTH
0039.205.2FT containing 0039.205.2 OTH
0039FT       containing 0039.205.6 TFF
[download]

This is no good as it implies an overlap between the two statute/casetype pairs when there isn't.

I need to somehow refer back to previous groups but only within a specific subset of the statute groups.

Update:
The grouping criteria is fairly straightforward. Statutes are not always reported exactly as they are written in the books. Since I am only interested in the criminal case types and not the statutes themselves, I would like to find the smalles portion of the statute that I can use to determine the case type. For example, all statutes that begin with 0024 and are third degree felonies can be placed in the TFF case type. There is a lower limit on the group size in that it must contain at least the first four characters (the chapter) of the statute being classified and may contain all of the characters. So, for my first example, 0024 is the smallest group I can find. On the other hand, for statutes in chapter 0039, there is no way to categorize the statute without using the complete statute value ie 0039.205.2 => OTH or 0039.205.6 => TFF.

Here is the code I have so far. Any suggestions will be most appreciated.

note: code rewritten so that it should run a complete example (and be clearer to read)

#! usr/bin/perl
# Compiler directives and Includes
use strict;
use warnings;
use diagnostics;
use Data::Dumper;

# Global declarations;

# Constants and pragmas
use constant MINSTATLEN03 => scalar 4;

#main()
{
    my $db01    = '';
    my %statgrp = ();
    my $table   = 'pgmsvcs.statgrp_ld';
    my $rptfh   = '';
    my $rtncd   = 0;
    
    $rtncd = FindStatGrps(\%statgrp, $table);
    exit;
} #end main()

sub FindStatGrps{
my $statgrp = shift;
my $table   = shift;

my $rtncd     = 0;
my @statlst01 = ();

push @statlst01, [ split /\s+/ ] while(<DATA>);

my %grplst = ();
my %placedstats = ();

STATUTE: foreach my $idx (0..$#statlst01){

  my $statute = $statlst01[$idx][0].$statlst01[$idx][1];
  
  next STATUTE if(exists($placedstats{$statute})); # skip if statute
                                                   # already classifie
+d

  GROUP: for my $i (MINSTATLEN03..length($statlst01[$idx][0])){
  
    next if substr($statute,$i-1,1) eq '.'; # skip if subgrp ends on
                                            # a subfield seperator (.)
    my $grpformatch = substr($statute,0,$i);
    my $levdeg = $statlst01[$idx][1];

                                            # initialize category bins
    my %srsgrp = (CM=>[],NCM=>[] ,SO=>[],ROB=>[],OTH=>[],BURG=>[],
                  TFF=>[],WC=>[],PROP=>[],DRG=>[],MISD=>[],OTH=>[],
                  DNC=>[]);

    my $j = $idx;
    do{                     # load indiv stats into respective bins
                            # does indiv stat belong to group?
      if($statlst01[$j]->[0] =~ /^$grpformatch/){
        if($statlst01[$j]->[1] eq $levdeg){
          push @{$srsgrp{$statlst01[$j]->[2]}}, $statlst01[$j];
        }
      }else{                # if statute does not match, stop looking
          $j = @statlst01;  # since list is ordered
      }
    }while(++$j < @statlst01);
                                                        
                            # check if more than one bin is occupied
    my $nrgrps = 0;
    (scalar @{$srsgrp{$_}} > 0) && $nrgrps++ foreach (keys %srsgrp);

    if($nrgrps > 1){        # 2 or more bins are occupied
      next GROUP;           # try inclrease group and try again
    }
    elsif($nrgrps == 1){    # only 1 bin occupied -- good
      my $srscat = undef;                    # determin bin name
      ($srscat || ((scalar @{$srsgrp{$_}} > 0) && ($srscat = $_))) 
                                               foreach (keys %srsgrp);
+ 

      my $grp = $grpformatch.$levdeg;        # define group key
      $statgrp->{$grp} = [];

      foreach (@{$srsgrp{$srscat}}){       # save indiv statute data
        push @{$statgrp->{$grp}}, $_;      # & updt list of already
        $placedstats{$_->[0].$_->[1]} = 1; # classified statutes
      }
      next STATUTE;         # group found, go to next statute
    }
    else{                   # zerp bins are occupied -- error
      die "error!!";        # at least one bin should be occupied
    }
  }
}
print Dumper($statgrp);
return $rtncd;
} #end FindStatGrps()

__DATA__
0024.118.3A FT TFF
0024.118.3B FT TFF
0024.118.3C FT TFF
0024.118.3D FT TFF
0024.118.4  FF TFF
0039.04     NN OTH
0039.205.2  FT OTH
0039.205.6  FT TFF
409.176.12A FT TFF
409.176.12B FT TFF
409.176.12C FT TFF
409.176.12D FT OTH
409.176.12E FT OTH
[download]

PJ
use strict; use warnings; use diagnostics; (if needed)

Comment on Creating Minimal Subgroups in a List of Characters Select or Download Code

Replies are listed 'Best First'.
Re: Creating Minimal Subgroups in a List of Characters by johnnywang (Priest) on Sep 01, 2004 at 20:33 UTC
This is very difficult to read, I'd suggest you strip the part that is not relevant (e.g., database query, etc.) and add those that is not complete (e.g., the function CvrtStatuteToDot), and make a script that people can run, which uses your example data as input, and produces your example output.	[reply]
Re: Creating Minimal Subgroups in a List of Characters by Roy Johnson (Monsignor) on Sep 02, 2004 at 18:52 UTC
Does this do what you want? #!perl use strict; # First, group them by the first part of the numeric field # and the other two columns # then find how much of the rest of the numeric field is needed # to distinguish the variations in the third column my %grouping; while(<DATA>) { my ($n, $n2, $c2, $c3) = /(\d+)\.(\S+)\s+(\w+)\s+(\w+)/; $grouping{$n}{$c2}{$c3}{$n2} = $_; } for my $n (sort {$a <=> $b} keys %grouping) { my $N = $grouping{$n}; for my $c2 (sort keys %$N) { my $C2 = $N->{$c2}; if (keys %$C2 > 1) { # print "$n$c2 is a complex group, containing:\n"; for my $c3 (sort keys %$C2) { my $C3 = $C2->{$c3}; # print "..(for c3=$c3):\n"; for my $n (sort keys %$C3) { print "$C3->{$n}"; } } } else { print "$n$c2 contains:\n"; my $C3 = $C2->{(keys %$C2)[0]}; for my $n (sort keys %$C3) { print " $C3->{$n}"; } } } } __DATA__ 0024.118.3A FT TFF 0024.118.3B FT TFF 0024.118.3C FT TFF 0024.118.3D FT TFF 0024.118.4 FF TFF 0039.04 NN OTH 0039.205.2 FT OTH 0039.205.6 FT TFF 409.176.12A FT TFF 409.176.12B FT TFF 409.176.12C FT TFF 409.176.12D FT OTH 409.176.12E FT OTH [download] Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re^2: Creating Minimal Subgroups in a List of Characters by periapt (Hermit) on Sep 03, 2004 at 12:29 UTC
This is great. It certainly seems to work. I'll have to expand it out to handle the full problem but it looks like a good solution. I actually coded up a different solution using hashes that works well. If your interested, you can look at the production code on my scratchpad here. Thanks for your help PJ use strict; use warnings; use diagnostics; (if needed)	[reply]
Re: Creating Minimal Subgroups in a List of Characters by davidj (Priest) on Sep 02, 2004 at 01:31 UTC
In addition to what johnnywang said, you also need to explain the grouping criteria. From your sample, it is not readily discernable. davidj	[reply]
Re: Creating Minimal Subgroups in a List of Characters by johnnywang (Priest) on Sep 02, 2004 at 17:58 UTC
That is better, but: The program itself still doesn't run, I removed the main() section, and just opened a file with filehandle DATA, it runs with some errors. So some more clean up is needed. Your explanation helps, can you give the desired output for that input data? and what output you're getting from the $statgrp (i.e, how you want to extract the output from $stargrp?) I'm a little unsure what you expect for the 409... case (Is the fifth row's 'FF' supposed to be 'FT'?) Just a remark, while it's interesting to see a real problem in criminal justice, please don't assume we know the terminologies, which is usually the blocking issue for the rest of us. For example, I assume the first column is "statue", the second column is "degree", and third column is "case type". You want a shortest statue and degree combination that determines a case type. Am I right?	[reply]
Re^2: Creating Minimal Subgroups in a List of Characters by periapt (Hermit) on Sep 03, 2004 at 12:39 UTC
Thanks for your comments. Sometimes when you've been working a problem closely, it is hard to see what isn't obvious to others. It is tough walking a line between being too technical and too vaugue. How do you simplify the issues without losing some essential element particularly on the edge cases. You're statement of the problem is a good one, however, "How to find the shortest statute and degree combination that determines the case type" is short, pithy and to the point. A lot better than my minimal subgroups ... title eh ;o) It was actually while cleaning and simplifying the code that I hit upon the algorithm I utimately implemented. Thanks for helping me get focused. PJ use strict; use warnings; use diagnostics; (if needed)	[reply]
Re: Creating Minimal Subgroups in a List of Characters by periapt (Hermit) on Sep 02, 2004 at 13:19 UTC
Thanks for the suggestions. You're right it was quite hard to read. Hopefully, I've updated the code to make it clearer. It should run as a standalone now. I tried to simplify it further but the results started to change so I guess this is as reduced as I can make it. Thanks again PJ use strict; use warnings; use diagnostics; (if needed)	[reply]