periapt has asked for the wisdom of the Perl Monks concerning the following question:

OK, so here is a problem I am working on. I have an ordered list of statute numbers, along with charge level, charge degree and criminal category. The list is ordered by statute. I am looking to find the smallest grouping of a set of charges that will uniquely identify the set. For example
__DATA__ 0024.118.3A F T TFF 0024.118.3B F T TFF 0024.118.3C F T TFF 0024.118.3D F T TFF 0024.118.4 F F TFF should (and does) produce the following groups 0024FT containing 0024.118.3A,B,C,D TFF 0024FF containing 0024.118.4 TFF (coincidently) the degree causes the change in grouping
The problem I'm having is in grouping when the statute has different degrees and/or categories. For example

Update: cleaned up code to make it more readable and configured it to run as a complete example

Update: Thanks to all for your comments. I have a working solution now. Go Monks ...

__DATA__ 0039.04 N N OTH 0039.205.2 F T OTH 0039.205.6 F T TFF should produce 0039NN containing 0039.04 OTH 0039.205.2FT containing 0039.205.2 OTH 0039.205.6FT containing 0039.205.6 TFF where the change in casetype changes the grouping but instead the code produces 0039NN containing 0039.04 OTH 0039.205.2FT containing 0039.205.2 OTH 0039FT containing 0039.205.6 TFF
This is no good as it implies an overlap between the two statute/casetype pairs when there isn't.

I need to somehow refer back to previous groups but only within a specific subset of the statute groups.

Update:
The grouping criteria is fairly straightforward. Statutes are not always reported exactly as they are written in the books. Since I am only interested in the criminal case types and not the statutes themselves, I would like to find the smalles portion of the statute that I can use to determine the case type. For example, all statutes that begin with 0024 and are third degree felonies can be placed in the TFF case type. There is a lower limit on the group size in that it must contain at least the first four characters (the chapter) of the statute being classified and may contain all of the characters. So, for my first example, 0024 is the smallest group I can find. On the other hand, for statutes in chapter 0039, there is no way to categorize the statute without using the complete statute value ie 0039.205.2 => OTH or 0039.205.6 => TFF.

Here is the code I have so far. Any suggestions will be most appreciated.

note: code rewritten so that it should run a complete example (and be clearer to read)
#! usr/bin/perl # Compiler directives and Includes use strict; use warnings; use diagnostics; use Data::Dumper; # Global declarations; # Constants and pragmas use constant MINSTATLEN03 => scalar 4; #main() { my $db01 = ''; my %statgrp = (); my $table = 'pgmsvcs.statgrp_ld'; my $rptfh = ''; my $rtncd = 0; $rtncd = FindStatGrps(\%statgrp, $table); exit; } #end main() sub FindStatGrps{ my $statgrp = shift; my $table = shift; my $rtncd = 0; my @statlst01 = (); push @statlst01, [ split /\s+/ ] while(<DATA>); my %grplst = (); my %placedstats = (); STATUTE: foreach my $idx (0..$#statlst01){ my $statute = $statlst01[$idx][0].$statlst01[$idx][1]; next STATUTE if(exists($placedstats{$statute})); # skip if statute # already classifie +d GROUP: for my $i (MINSTATLEN03..length($statlst01[$idx][0])){ next if substr($statute,$i-1,1) eq '.'; # skip if subgrp ends on # a subfield seperator (.) my $grpformatch = substr($statute,0,$i); my $levdeg = $statlst01[$idx][1]; # initialize category bins my %srsgrp = (CM=>[],NCM=>[] ,SO=>[],ROB=>[],OTH=>[],BURG=>[], TFF=>[],WC=>[],PROP=>[],DRG=>[],MISD=>[],OTH=>[], DNC=>[]); my $j = $idx; do{ # load indiv stats into respective bins # does indiv stat belong to group? if($statlst01[$j]->[0] =~ /^$grpformatch/){ if($statlst01[$j]->[1] eq $levdeg){ push @{$srsgrp{$statlst01[$j]->[2]}}, $statlst01[$j]; } }else{ # if statute does not match, stop looking $j = @statlst01; # since list is ordered } }while(++$j < @statlst01); # check if more than one bin is occupied my $nrgrps = 0; (scalar @{$srsgrp{$_}} > 0) && $nrgrps++ foreach (keys %srsgrp); if($nrgrps > 1){ # 2 or more bins are occupied next GROUP; # try inclrease group and try again } elsif($nrgrps == 1){ # only 1 bin occupied -- good my $srscat = undef; # determin bin name ($srscat || ((scalar @{$srsgrp{$_}} > 0) && ($srscat = $_))) foreach (keys %srsgrp); + my $grp = $grpformatch.$levdeg; # define group key $statgrp->{$grp} = []; foreach (@{$srsgrp{$srscat}}){ # save indiv statute data push @{$statgrp->{$grp}}, $_; # & updt list of already $placedstats{$_->[0].$_->[1]} = 1; # classified statutes } next STATUTE; # group found, go to next statute } else{ # zerp bins are occupied -- error die "error!!"; # at least one bin should be occupied } } } print Dumper($statgrp); return $rtncd; } #end FindStatGrps() __DATA__ 0024.118.3A FT TFF 0024.118.3B FT TFF 0024.118.3C FT TFF 0024.118.3D FT TFF 0024.118.4 FF TFF 0039.04 NN OTH 0039.205.2 FT OTH 0039.205.6 FT TFF 409.176.12A FT TFF 409.176.12B FT TFF 409.176.12C FT TFF 409.176.12D FT OTH 409.176.12E FT OTH

PJ
use strict; use warnings; use diagnostics; (if needed)

Replies are listed 'Best First'.
Re: Creating Minimal Subgroups in a List of Characters
by johnnywang (Priest) on Sep 01, 2004 at 20:33 UTC
    This is very difficult to read, I'd suggest you strip the part that is not relevant (e.g., database query, etc.) and add those that is not complete (e.g., the function CvrtStatuteToDot), and make a script that people can run, which uses your example data as input, and produces your example output.
Re: Creating Minimal Subgroups in a List of Characters
by Roy Johnson (Monsignor) on Sep 02, 2004 at 18:52 UTC
    Does this do what you want?
    #!perl use strict; # First, group them by the first part of the numeric field # and the other two columns # then find how much of the rest of the numeric field is needed # to distinguish the variations in the third column my %grouping; while(<DATA>) { my ($n, $n2, $c2, $c3) = /(\d+)\.(\S+)\s+(\w+)\s+(\w+)/; $grouping{$n}{$c2}{$c3}{$n2} = $_; } for my $n (sort {$a <=> $b} keys %grouping) { my $N = $grouping{$n}; for my $c2 (sort keys %$N) { my $C2 = $N->{$c2}; if (keys %$C2 > 1) { # print "$n$c2 is a complex group, containing:\n"; for my $c3 (sort keys %$C2) { my $C3 = $C2->{$c3}; # print "..(for c3=$c3):\n"; for my $n (sort keys %$C3) { print "$C3->{$n}"; } } } else { print "$n$c2 contains:\n"; my $C3 = $C2->{(keys %$C2)[0]}; for my $n (sort keys %$C3) { print " $C3->{$n}"; } } } } __DATA__ 0024.118.3A FT TFF 0024.118.3B FT TFF 0024.118.3C FT TFF 0024.118.3D FT TFF 0024.118.4 FF TFF 0039.04 NN OTH 0039.205.2 FT OTH 0039.205.6 FT TFF 409.176.12A FT TFF 409.176.12B FT TFF 409.176.12C FT TFF 409.176.12D FT OTH 409.176.12E FT OTH

    Caution: Contents may have been coded under pressure.

      This is great. It certainly seems to work. I'll have to expand it out to handle the full problem but it looks like a good solution.

      I actually coded up a different solution using hashes that works well. If your interested, you can look at the production code on my scratchpad here.

      Thanks for your help

      PJ
      use strict; use warnings; use diagnostics; (if needed)
Re: Creating Minimal Subgroups in a List of Characters
by davidj (Priest) on Sep 02, 2004 at 01:31 UTC
    In addition to what johnnywang said, you also need to explain the grouping criteria. From your sample, it is not readily discernable.

    davidj

Re: Creating Minimal Subgroups in a List of Characters
by johnnywang (Priest) on Sep 02, 2004 at 17:58 UTC
    That is better, but:
    • The program itself still doesn't run, I removed the main() section, and just opened a file with filehandle DATA, it runs with some errors. So some more clean up is needed.
    • Your explanation helps, can you give the desired output for that input data? and what output you're getting from the $statgrp (i.e, how you want to extract the output from $stargrp?) I'm a little unsure what you expect for the 409... case (Is the fifth row's 'FF' supposed to be 'FT'?)
    • Just a remark, while it's interesting to see a real problem in criminal justice, please don't assume we know the terminologies, which is usually the blocking issue for the rest of us. For example, I assume the first column is "statue", the second column is "degree", and third column is "case type". You want a shortest statue and degree combination that determines a case type. Am I right?
      Thanks for your comments. Sometimes when you've been working a problem closely, it is hard to see what isn't obvious to others. It is tough walking a line between being too technical and too vaugue. How do you simplify the issues without losing some essential element particularly on the edge cases. You're statement of the problem is a good one, however, "How to find the shortest statute and degree combination that determines the case type" is short, pithy and to the point. A lot better than my minimal subgroups ... title eh ;o)

      It was actually while cleaning and simplifying the code that I hit upon the algorithm I utimately implemented. Thanks for helping me get focused.

      PJ
      use strict; use warnings; use diagnostics; (if needed)
Re: Creating Minimal Subgroups in a List of Characters
by periapt (Hermit) on Sep 02, 2004 at 13:19 UTC
    Thanks for the suggestions. You're right it was quite hard to read. Hopefully, I've updated the code to make it clearer. It should run as a standalone now. I tried to simplify it further but the results started to change so I guess this is as reduced as I can make it.

    Thanks again

    PJ
    use strict; use warnings; use diagnostics; (if needed)