grouping lines of data together for later use

jonnyfolk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a data file in which holds information on different items accorded to individual users. A user might have more than one item (each one being unique), in which case he is accorded a new line for each item.

Users with multiple items are treated differently to those with just the one and so I am trying to group together the lines for each user and then trat them on a case by case basis.

I have written the following code which does seem to do the trick, but I would like advice and comments from my fellow monks on how I could do it better (and to ensure that I'm not fluking the results and thus saving up problems for the future.

I have commented the code to try to explain my thought process (just in case you thought there wasn't one!! )

Any help given would be much appreciated.

open PAGE, "$website" or die "Cant open $website: $!";
flock (PAGE, 1) or die "Can't lock website file for reading";
    
while (my $line = (<PAGE>))  {
($Prop,$colour,$txtcol,$url360,$user,$your_name,$address,$town,$zip_co
+de,$country,$email,$telephone_no,$telephone_no2,$theme,$web_address,$
+ppemail,undef,undef,undef) = split "\t", $line;
#create array of users and hash of data
    $onprop{$Prop} = [$Prop,$colour,$txtcol,$url360,$user,$your_name,$
+address,$town,$zip_code,$country,$email,$telephone_no,$telephone_no2,
+$theme,$web_address,$ppemail];
    push (@dataarray, $user);

}

#iterate over users
while (<@dataarray>) {
my @microdata;
    #remove examine first user
    my $item = shift (@dataarray);
    foreach my $scalar (keys %onprop) {
        unless (! $item) {
            #compare lines of hash looking for lines with same user
            if ( @{ $onprop{ $scalar } }[4]  eq $item) {
                #if found, remove from array
                my $a = shift(@dataarray);
                push (@microdata, @{ $onprop{ $scalar } });
            }
        }
    }
    #store all results for this owner in single reference
    my $refmicro = \@microdata;
    push (@group, $refmicro);
}

foreach (@group) {
    print "@{ $_ }<br><br>";
}
[download]

Update: I enclose a rewrite of the code including data to make it easier to view the results. I knew before I'd posted that what I'd done was 'clunky' but I wanted to get something working before I came looking for help...

#!/usr/bin/perl -w
use strict;
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
use CGI ':standard';
#definitions:
#=========

my  (
    $Prop,
    $colour,
    $txtcol,
    $url360,
    $user,
    $your_name,
    $address,
    $town,
    $zip_code,
    $country,
    $email,
    $telephone_no,
    $telephone_no2,
    $theme,
    $web_address,
    $ppemail
    );
my %onprop;
my @dataarray;
my @group;


print "Content-type: text/html\n\n";
    
while (my $line = (<DATA>))  {
($Prop,$colour,$txtcol,$url360,$user,$your_name,$address,$town,$zip_co
+de,$country,$email,$telephone_no,$telephone_no2,$theme,$web_address,$
+ppemail,undef,undef,undef) = split "\t", $line;
#create array of users and hash of data
    $onprop{$Prop} = [$Prop,$colour,$txtcol,$url360,$user,$your_name,$
+address,$town,$zip_code,$country,$email,$telephone_no,$telephone_no2,
+$theme,$web_address,$ppemail];
    push (@dataarray, $user);

}

#iterate over users
while (<@dataarray>) {
my @microdata;
    #remove examine first user
    my $item = shift (@dataarray);
    foreach my $scalar (keys %onprop) {
        unless (! $item) {
            #compare lines of hash looking for lines with same user
            if ( @{ $onprop{ $scalar } }[4]  eq $item) {
                #if found, remove from array
                my $a = shift(@dataarray);
                push (@microdata, @{ $onprop{ $scalar } });
            }
        }
    }
    #store all results for this owner in single reference
    my $refmicro = \@microdata;
    push (@group, $refmicro);
}

foreach (@group) {
    print "@{ $_ }<br><br>";
}

__DATA__

012    undef    undef    undef    graham    undef                    u
+ndef    undef    undef    undef        undef            
037    undef    undef        graham    undef    undef    undef    unde
+f    undef    undef    undef        undef        undef            
028    red    tdblk        johnandmark    undef    undef     undef    
+undef    undef    undef    undef        undef                    
108    yellow    tdblk    undef    johnandmark    undef    undef     u
+ndef    undef    UK    undef    undef    undef    undef              
+      
013    blue    tdblk    undef    jon    undef    undef     undef    un
+def    undef    undef    undef    undef    undef    undef    undef   
+                 
008    blue    tdblk        malcolm    undef    undef    undef    unde
+f    undef    undef    undef        undef                    
133    green    tdblk        sharon    undef    undef    undef    unde
+f    uk    undef    undef        undef    undef    undef            
047    blue    tdblk        gill    undef    undef     undef    undef 
+   undef    undef    undef        undef    undef
[download]

Further Update: This site never ceases to amaze - so many people give freely of their time and advice. I'm grateful for the knowledge that has been imparted - I feel much better armed to go away and write a more robust and incisive script for the task I have in mind. Thanks to all.

Comment on grouping lines of data together for later use Select or Download Code

Replies are listed 'Best First'.
Re: grouping lines of data together for later use by reasonablekeith (Deacon) on Sep 27, 2005 at 13:11 UTC
I'm with blazer on this one, I don't like all those variables names. Having to take "username" by slicing the 5th element from your array is creating depencies on the order of your data. I'd have read the data into hashes in like this. use Data::Dumper; my %on_prop; my %on_user; while (<DATA>) { my @fields = split "\t"; my %column_data = map {$_, shift(@fields)} qw(PROP COLOUR TXTCOL U +RL360 USER YOUR_NAME ETC); $on_prop{$column_data{PROP}} = \%column_data; push @{$on_user{$column_data{USER}}}, \%column_data; } print Dumper(\%on_prop); print Dumper(\%on_user); print scalar @{$on_user{'rob'}}; # prints 2 (more than one item for t +his user) __DATA__ 123 Red blue whatever rob blah blah blah 124 Red blue whatever rob blah blah blah 125 Red blue whatever jon blah blah blah [download] --- my name's not Keith, and I'm not reasonable.	[reply] [d/l]
Re: grouping lines of data together for later use by Util (Priest) on Sep 27, 2005 at 14:25 UTC
Problems I see in your code: Your algorithm is fragile; it depends on the number of `$user` fields (scattered in `%onprop`) to match the same number of `$user` fields (clustered in `@dataarray`). It took me quite a bit of reading to understand this dependency, so I would say it will be 'non-obvious' to your maintenance programmers. If the lines in PAGE should change their sort order in the future (un-clustering the `$user`s in `@dataarray`), or code is added to pre-filter the `$user`s in only one of the two data structures, then your code will fail badly and invisibly, shifting away data with no warning that anything is amiss. The grouping problem is usually solved with a hash of arrays (HoA); see my code below. Your algorithm is wasteful; you must loop through `%onprop`, with hash lookups inside the loop, (`keys(%onprop)` * number_of_distinct_users) times. Also, your `unless (! $item)` could be moved outside the `foreach my $scalar` loop. Two problems with `my $a = shift(@dataarray);` The use of `$a` or `$b` outside of `sort` blocks is frowned-upon, due to their semi-magic. Since you are throwing the data away, it is better written without the receiving variable (void context), as in `shift @dataarray;` As others (++ to all) have pointed out, huge lists of scalar vars is non-optimal. @field_names and hash slices allow you to get rid of mystery numbers like 4 in `@{ $onprop{ $scalar } }[4]`. When you said `while (<@dataarray>)`, I shuddered. It would probably never fail in this particular program, but you should use `while (@dataarray)` instead. This should illustrated the difference: `$ perl -MO=Deparse -e 'while (@d) {}' while (@d) { (); } $ perl -MO=Deparse -e 'while (<@d>) {}' use File::Glob (); while (defined($_ = glob(join($", @d)))) { (); }` [download] Here is my re-write of your core code, using HoA grouping and hash slices. Working, tested code: #!/usr/bin/perl -W use strict; use warnings 'all'; use Data::Dumper; $Data::Dumper::Useqq = 1; $\| = 1; my @field_names = qw( PROP COLOUR TXTCOL URL360 USER YOUR_NAME ADDRESS TOWN ZIP_CODE COUNTRY EMAIL TELEPHONE_NO TELEPHONE_NO2 THEME WEB_ADDRESS PPEMAIL JUNK1 JUNK2 JUNK3 ); # Simplified @field_names for this example: @field_names = qw( PROP COLOUR USER YOUR_NAME ); # 'Prop' is a unique key in file PAGE and in hash %onprop. my %onprop; # HoH # 'user' is not unique, so each hash entry will be an array of 'Prop's +. my %user_props; # HoA while (<DATA>) { chomp; next unless /\S/; my @fields = split "\t"; warn unless @fields == @field_names; my %h; @h{@field_names} = @fields; my $prop = $h{PROP} or warn; my $user = $h{USER} or warn; # Create array of users and hash of data warn "Overwriting '$prop'" if exists $onprop{$prop}; $onprop{$prop} = \%h; push @{ $user_props{$user} }, $prop; } print Data::Dumper->Dump( [ \%onprop, \%user_props ], [ qw( onprop user_props ) ], ); print join("\t", @field_names), "\n\n"; foreach my $user ( sort keys %user_props ) { my @prop_list = @{ $user_props{$user} }; my $number_of_props = @prop_list; print "Grouped data ($number_of_props lines) for user '$user'\n"; foreach my $prop (@prop_list) { my %h = %{ $onprop{$prop} }; my @fields = @h{@field_names}; print join("\t", @fields), "\n"; } print "\n"; } __DATA__ foo red jonnyfolk Saint J bar white Util Me baz blue jonnyfolk Saint J qux black Util Me2 [download]	[reply] [d/l] [select]
Re^2: grouping lines of data together for later use by blazar (Canon) on Sep 27, 2005 at 15:29 UTC
When you said while `(<@dataarray>)`, I shuddered. Indeed, when I saw it, I thought the code had not been really tested as claimed, for I could have bet it would have yielded an error. But this doesn't seem to be the case: `$ perl -MO=Deparse -e '@a=qw/a b c/; print while <@a>' @a = ('a', 'b', 'c'); use File::Glob (); print $_ while defined($_ = glob(join($", @a))); -e syntax OK` [download] and actually IMHO by virtue of an (un?)fortunate chance: `$ perl -le '@a=qw/a b c/; print while <@a>' a b c` [download] It seems that `while (<whatever>)`'s magic applies even if those angular parens are not to be interpreted as the IO-diamond operator but as the glob one. Now I wonder if this is a (perhaps unavoidable) side-effect or if it is intentional, although I doubt about the latter possibility. Well, this may be a good subject for another question, or for a meditation... Update: done!	[reply] [d/l] [select]
Re: grouping lines of data together for later use by blazar (Canon) on Sep 27, 2005 at 12:09 UTC
Huh?!? I don't have time to read your code carefully now, but I'm not really sure you're under strict, although you do use my. OTOH there are long, too long IMHO, lists of "non my'ed" vars and all in all I find it hardly readable. So, if you're not under strict (and warnings), please do!! Said this, judging from your subject & description, as of the queek peek I gave into it, the obvious, mandatory quick answer would be: hashes. Since I don't seem to see any in your code chances are that it could be the correct one. Check perldata for more info. Update(taking into account your {reply,remark}): `open PAGE, "$website" or die "Cant open $website: $!"; flock (PAGE, 1) or die "Can't lock website file for reading";` [download] The usual recommendation about three arga form of open and lexical FHs apply. Also, always include `$!` in {error,warning} messages about failed system calls. `while (my $line = (<PAGE>)) { ($Prop,$colour,$txtcol,$url360,$user,$your_name,$address,$town,$zip_co +de,$country,$email,$telephone_no,$telephone_no2,$theme,$web_address,$ +ppemail,undef,undef,undef) = split "\t", $line; #create array of users and hash of data $onprop{$Prop} = [$Prop,$colour,$txtcol,$url360,$user,$your_name,$ +address,$town,$zip_code,$country,$email,$telephone_no,$telephone_no2, +$theme,$web_address,$ppemail]; push (@dataarray, $user); }` [download] Commented this here. `while (<@dataarray>) {` [download] Huh?!? This works, if it does work, by accident, only because probably glob doesn't do anything on those items... `unless (! $item) {` [download] How 'bout `if ($item) { ...` [download] And it's largely a matter of personal preferences, but how 'bout `next unless $item;` [download] instead? And so on...	[reply] [d/l] [select]
Re^2: grouping lines of data together for later use by jonnyfolk (Vicar) on Sep 27, 2005 at 12:24 UTC
Thanks for the quick answer - I can see you neither had time to read the code or the question (you'll see the quick answer doesn't really fit :) ) I should probably have mentioned (but didn't think to) that this is an excerpt from a larger script which does work under strict, is properly declared and if you take a closer look you'll see a hash staring right back out at you!! As I also mentioned it does do the job I require, but I'd like an opinion from fellow monks as I'm not well skilled at this sort of thing. Cheers Update: I appreciate your further comments and will try to implement and learn for future use. Thanks very much.	[reply]
Re^3: grouping lines of data together for later use by blazar (Canon) on Sep 27, 2005 at 12:50 UTC
I should probably have mentioned (but didn't think to) that this is an excerpt from a larger script which does work under strict, is properly declared and if you take a closer look you'll see a hash staring right back out at you!! Sorry. Now I see. Indeed that long list of variables is mostly confusing (to me). To increase readability you may either use a list slice, and optionally put it into a temporary array: since you know which are the fields you're interested in in any case, you can access the (few) ones you want to use as "keys" by numeric subscript and the (many) other ones like aggregates or turn those variables into hash(ref) keys (thus possibly adopting a more complex, but easier to deal with data structure) hopefully specified with more whitespace to increase readability instead. As an example of the second point I mean something like (but modify at will): `while (<$page>) { my %localprop; @localprop{ qw(Prop colour txtcol url360 user your_name address town zip_code country email telephone_no telephone_no2 theme web_address ppemail) } = split /\t/; $onprop{ $localprop{Prop} } = \%localprop; push @dataarray, $localprop{$user}; }` [download] As I also mentioned it does do the job I require, but I'd like an opinion from fellow monks as I'm not well skilled at this sort of thing. Well, a suggestion is above, and the proposed code -although it may not be exactly what you need- is IMHO much less of a mess than the original one...	[reply] [d/l]