jonnyfolk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a data file in which holds information on different items accorded to individual users. A user might have more than one item (each one being unique), in which case he is accorded a new line for each item.

Users with multiple items are treated differently to those with just the one and so I am trying to group together the lines for each user and then trat them on a case by case basis.

I have written the following code which does seem to do the trick, but I would like advice and comments from my fellow monks on how I could do it better (and to ensure that I'm not fluking the results and thus saving up problems for the future.

I have commented the code to try to explain my thought process (just in case you thought there wasn't one!! )

Any help given would be much appreciated.

open PAGE, "$website" or die "Cant open $website: $!"; flock (PAGE, 1) or die "Can't lock website file for reading"; while (my $line = (<PAGE>)) { ($Prop,$colour,$txtcol,$url360,$user,$your_name,$address,$town,$zip_co +de,$country,$email,$telephone_no,$telephone_no2,$theme,$web_address,$ +ppemail,undef,undef,undef) = split "\t", $line; #create array of users and hash of data $onprop{$Prop} = [$Prop,$colour,$txtcol,$url360,$user,$your_name,$ +address,$town,$zip_code,$country,$email,$telephone_no,$telephone_no2, +$theme,$web_address,$ppemail]; push (@dataarray, $user); } #iterate over users while (<@dataarray>) { my @microdata; #remove examine first user my $item = shift (@dataarray); foreach my $scalar (keys %onprop) { unless (! $item) { #compare lines of hash looking for lines with same user if ( @{ $onprop{ $scalar } }[4] eq $item) { #if found, remove from array my $a = shift(@dataarray); push (@microdata, @{ $onprop{ $scalar } }); } } } #store all results for this owner in single reference my $refmicro = \@microdata; push (@group, $refmicro); } foreach (@group) { print "@{ $_ }<br><br>"; }
Update: I enclose a rewrite of the code including data to make it easier to view the results. I knew before I'd posted that what I'd done was 'clunky' but I wanted to get something working before I came looking for help...
#!/usr/bin/perl -w use strict; use CGI::Carp qw(fatalsToBrowser warningsToBrowser); use CGI ':standard'; #definitions: #========= my ( $Prop, $colour, $txtcol, $url360, $user, $your_name, $address, $town, $zip_code, $country, $email, $telephone_no, $telephone_no2, $theme, $web_address, $ppemail ); my %onprop; my @dataarray; my @group; print "Content-type: text/html\n\n"; while (my $line = (<DATA>)) { ($Prop,$colour,$txtcol,$url360,$user,$your_name,$address,$town,$zip_co +de,$country,$email,$telephone_no,$telephone_no2,$theme,$web_address,$ +ppemail,undef,undef,undef) = split "\t", $line; #create array of users and hash of data $onprop{$Prop} = [$Prop,$colour,$txtcol,$url360,$user,$your_name,$ +address,$town,$zip_code,$country,$email,$telephone_no,$telephone_no2, +$theme,$web_address,$ppemail]; push (@dataarray, $user); } #iterate over users while (<@dataarray>) { my @microdata; #remove examine first user my $item = shift (@dataarray); foreach my $scalar (keys %onprop) { unless (! $item) { #compare lines of hash looking for lines with same user if ( @{ $onprop{ $scalar } }[4] eq $item) { #if found, remove from array my $a = shift(@dataarray); push (@microdata, @{ $onprop{ $scalar } }); } } } #store all results for this owner in single reference my $refmicro = \@microdata; push (@group, $refmicro); } foreach (@group) { print "@{ $_ }<br><br>"; } __DATA__ 012 undef undef undef graham undef u +ndef undef undef undef undef 037 undef undef graham undef undef undef unde +f undef undef undef undef undef 028 red tdblk johnandmark undef undef undef +undef undef undef undef undef 108 yellow tdblk undef johnandmark undef undef u +ndef undef UK undef undef undef undef + 013 blue tdblk undef jon undef undef undef un +def undef undef undef undef undef undef undef + 008 blue tdblk malcolm undef undef undef unde +f undef undef undef undef 133 green tdblk sharon undef undef undef unde +f uk undef undef undef undef undef 047 blue tdblk gill undef undef undef undef + undef undef undef undef undef
Further Update: This site never ceases to amaze - so many people give freely of their time and advice. I'm grateful for the knowledge that has been imparted - I feel much better armed to go away and write a more robust and incisive script for the task I have in mind. Thanks to all.

Replies are listed 'Best First'.
Re: grouping lines of data together for later use
by reasonablekeith (Deacon) on Sep 27, 2005 at 13:11 UTC
    I'm with blazer on this one, I don't like all those variables names. Having to take "username" by slicing the 5th element from your array is creating depencies on the order of your data. I'd have read the data into hashes in like this.

    use Data::Dumper; my %on_prop; my %on_user; while (<DATA>) { my @fields = split "\t"; my %column_data = map {$_, shift(@fields)} qw(PROP COLOUR TXTCOL U +RL360 USER YOUR_NAME ETC); $on_prop{$column_data{PROP}} = \%column_data; push @{$on_user{$column_data{USER}}}, \%column_data; } print Dumper(\%on_prop); print Dumper(\%on_user); print scalar @{$on_user{'rob'}}; # prints 2 (more than one item for t +his user) __DATA__ 123 Red blue whatever rob blah blah blah 124 Red blue whatever rob blah blah blah 125 Red blue whatever jon blah blah blah
    ---
    my name's not Keith, and I'm not reasonable.
Re: grouping lines of data together for later use
by Util (Priest) on Sep 27, 2005 at 14:25 UTC

    Problems I see in your code:

    • Your algorithm is fragile; it depends on the number of $user fields (scattered in %onprop) to match the same number of $user fields (clustered in @dataarray). It took me quite a bit of reading to understand this dependency, so I would say it will be 'non-obvious' to your maintenance programmers. If the lines in PAGE should change their sort order in the future (un-clustering the $users in @dataarray), or code is added to pre-filter the $users in only one of the two data structures, then your code will fail badly and *invisibly*, shifting away data with no warning that anything is amiss.
      The grouping problem is usually solved with a hash of arrays (HoA); see my code below.
    • Your algorithm is wasteful; you must loop through %onprop, with hash lookups inside the loop, (keys(%onprop) * number_of_distinct_users) times. Also, your unless (! $item) could be moved outside the foreach my $scalar loop.
    • Two problems with my $a = shift(@dataarray);
      1. The use of $a or $b outside of sort blocks is frowned-upon, due to their semi-magic.
      2. Since you are throwing the data away, it is better written without the receiving variable (void context), as in shift @dataarray;
    • As others (++ to all) have pointed out, huge lists of scalar vars is non-optimal. @field_names and hash slices allow you to get rid of mystery numbers like 4 in @{ $onprop{ $scalar } }[4].
    • When you said while (<@dataarray>), I shuddered. It would probably never fail in this particular program, but you should use while (@dataarray) instead. This should illustrated the difference:
      $ perl -MO=Deparse -e 'while (@d) {}' while (@d) { (); } $ perl -MO=Deparse -e 'while (<@d>) {}' use File::Glob (); while (defined($_ = glob(join($", @d)))) { (); }

    Here is my re-write of your core code, using HoA grouping and hash slices. Working, tested code:

    #!/usr/bin/perl -W use strict; use warnings 'all'; use Data::Dumper; $Data::Dumper::Useqq = 1; $| = 1; my @field_names = qw( PROP COLOUR TXTCOL URL360 USER YOUR_NAME ADDRESS TOWN ZIP_CODE COUNTRY EMAIL TELEPHONE_NO TELEPHONE_NO2 THEME WEB_ADDRESS PPEMAIL JUNK1 JUNK2 JUNK3 ); # Simplified @field_names for this example: @field_names = qw( PROP COLOUR USER YOUR_NAME ); # 'Prop' is a unique key in file PAGE and in hash %onprop. my %onprop; # HoH # 'user' is not unique, so each hash entry will be an array of 'Prop's +. my %user_props; # HoA while (<DATA>) { chomp; next unless /\S/; my @fields = split "\t"; warn unless @fields == @field_names; my %h; @h{@field_names} = @fields; my $prop = $h{PROP} or warn; my $user = $h{USER} or warn; # Create array of users and hash of data warn "Overwriting '$prop'" if exists $onprop{$prop}; $onprop{$prop} = \%h; push @{ $user_props{$user} }, $prop; } print Data::Dumper->Dump( [ \%onprop, \%user_props ], [ qw( *onprop *user_props ) ], ); print join("\t", @field_names), "\n\n"; foreach my $user ( sort keys %user_props ) { my @prop_list = @{ $user_props{$user} }; my $number_of_props = @prop_list; print "Grouped data ($number_of_props lines) for user '$user'\n"; foreach my $prop (@prop_list) { my %h = %{ $onprop{$prop} }; my @fields = @h{@field_names}; print join("\t", @fields), "\n"; } print "\n"; } __DATA__ foo red jonnyfolk Saint J bar white Util Me baz blue jonnyfolk Saint J qux black Util Me2

      When you said while (<@dataarray>), I shuddered.
      Indeed, when I saw it, I thought the code had not been really tested as claimed, for I could have bet it would have yielded an error. But this doesn't seem to be the case:
      $ perl -MO=Deparse -e '@a=qw/a b c/; print while <@a>' @a = ('a', 'b', 'c'); use File::Glob (); print $_ while defined($_ = glob(join($", @a))); -e syntax OK
      and actually IMHO by virtue of an (un?)fortunate chance:
      $ perl -le '@a=qw/a b c/; print while <@a>' a b c
      It seems that while (<whatever>)'s magic applies even if those angular parens are not to be interpreted as the IO-diamond operator but as the glob one. Now I wonder if this is a (perhaps unavoidable) side-effect or if it is intentional, although I doubt about the latter possibility.

      Well, this may be a good subject for another question, or for a meditation... Update: done!

Re: grouping lines of data together for later use
by blazar (Canon) on Sep 27, 2005 at 12:09 UTC
    Huh?!? I don't have time to read your code carefully now, but I'm not really sure you're under strict, although you do use my. OTOH there are long, too long IMHO, lists of "non my'ed" vars and all in all I find it hardly readable.

    So, if you're not under strict (and warnings), please do!!

    Said this, judging from your subject & description, as of the queek peek I gave into it, the obvious, mandatory quick answer would be: hashes. Since I don't seem to see any in your code chances are that it could be the correct one.

    Check perldata for more info.

    Update(taking into account your {reply,remark}):

    open PAGE, "$website" or die "Cant open $website: $!"; flock (PAGE, 1) or die "Can't lock website file for reading";
    The usual recommendation about three arga form of open and lexical FHs apply. Also, always include $! in {error,warning} messages about failed system calls.
    while (my $line = (<PAGE>)) { ($Prop,$colour,$txtcol,$url360,$user,$your_name,$address,$town,$zip_co +de,$country,$email,$telephone_no,$telephone_no2,$theme,$web_address,$ +ppemail,undef,undef,undef) = split "\t", $line; #create array of users and hash of data $onprop{$Prop} = [$Prop,$colour,$txtcol,$url360,$user,$your_name,$ +address,$town,$zip_code,$country,$email,$telephone_no,$telephone_no2, +$theme,$web_address,$ppemail]; push (@dataarray, $user); }
    Commented this here.
    while (<@dataarray>) {
    Huh?!? This works, if it does work, by accident, only because probably glob doesn't do anything on those items...
    unless (! $item) {
    How 'bout
    if ($item) { ...
    And it's largely a matter of personal preferences, but how 'bout
    next unless $item;
    instead?

    And so on...

      Thanks for the quick answer - I can see you neither had time to read the code or the question (you'll see the quick answer doesn't really fit :) )

      I should probably have mentioned (but didn't think to) that this is an excerpt from a larger script which does work under strict, is properly declared and if you take a closer look you'll see a hash staring right back out at you!!

      As I also mentioned it does do the job I require, but I'd like an opinion from fellow monks as I'm not well skilled at this sort of thing.

      Cheers

      Update: I appreciate your further comments and will try to implement and learn for future use. Thanks very much.