Revisiting array context

I posted a meditation a couple of weeks ago about using += in array context. The replies indicated that I was on the right track, in that there does not exist a simple way of saying

  my @foo = qw/1 2 3/;
  my @bar = qw/4 6 8/;

  @foo += @bar; # foo is now equivalent to qw/5 8 11/
[download]

Somewhere along the line you're going to have a map or a foreach trundling over the array elements. I now have a different problem to solve, and suddenly... I was enlightened.

I have a large directory tree (several tens of thousands of files), and I suspect that all the files are owned and grouped to the same user account. A simple File::Find script will be able to sort that out for me very simply. All I have to do is stat each file and increment two hashes, keyed on the uid and the gid of the file. At the end, I print out the hashes and I'll know if my theory is true or not.

While I have yet to write a single line of code, I can already envision pretty well what the code is going to look like. And I pondered on how best to increment the hashes. I figured that I wanted to take a slice of the list returned from stat. (4 for the uid, 5 for the gid). And I got stuck on how to transfer that two element list over to incremente those two hash keys.

And I realised that I was being haunted once again by a different facet of my array context demon. So at least I know that the following code is about the best I can do (insofar as I can't get rid of the temporary lexicals):

  my( $uid, $gid ) = (stat $file)[4,5];
  ++$user{$uid};
  ++$group{$gid};
[download]

I.e. you can't do that in one line. And then it suddenly hit me: what I am really trying to do is to point out to the compiler that what I want to do is parallelisable. I suppose this is because I've just finished an article on IA-64 (a.k.a. Itanium, a.k.a. P4) optimisation. This latest chip from Intel allows all sorts of parallel execution to take place, indeed, it's the only way to get the maximum performance out of it.

It turns out that I don't care what order the hash values are incremented. If the CPU decides to schedule both at once, I'll be thrilled to bits. Harking back to my array assignment, on a sufficiently CPU-rich machine, the operation should be able to be executed in more or less constant time. Now I haven't studied parallelism in computing since the Transputer and Occam, so I'm way out of date. I don't know what the state of the art is in terms of compilers being able to spot possibilities for parallel execution in Von Neuman-style procedural languages, but maybe with a bit of syntactic sugar in Perl 6 we'll be able to come closer to it.

update: A number of people have posted a number of interesting techniques to both of these meditations to help me out with the problem at hand, for which I am immensely grateful. But I'm taking the long view. It still begs the question, as to whether Perl is to be instructed in minute detail about what is to be done, or whether it is possible to detect parallel semantics directly from the source code (without either spelling it out in detail or using silly Oracle-style hints to tell the compiler what you want).

If this is possible, my feeling is that the source code is going to look short, sharp and sweet. If parallel processing is going to come to the masses, it's going to be via a mainstream language, be it Perl, Java, Python or C$, err, C#. With its rich linguistic background, Perl has a good chance of pulling it off.

Specifically, in response to DrZaius, I'm thinking of intrathread parallelism. Thread-level parallelism is already too explicit (read: lots of fussy code) for me to want to bother dealing with, and it's cracking nuts with a sledgehammer in this particular context. I only want to keep that CPU pipeline stuffed to the gills. The less explicit syntax I have to use, the better chance there is of pulling it off.

update 2: by the way, for those stumbled across this node via a search and are actually looking for some code to do what I just described, here is what I wound up writing:

#! /usr/bin/perl -w
# scan-uid-gid  copyright (c) David Landgren 2001
use strict;
use File::Find;

my $root = shift; $root ||= '.' unless defined $root;

my( %user, %group );
find sub {
        return if $_ eq '.' or $_ eq '..';
        my( $uid, $gid ) = (stat)[4,5];
        $user{$uid}++;
        $group{$gid}++;
    },
    $root;

print "uids\n";
print "\t$_: $user{$_}\n" foreach( sort keys %user );

print "gids\n";
print "\t$_: $group{$_}\n" foreach( sort keys %group );
[download]

It turns out my hypothesis was correct :-), I received the following output:

# scan-uid-gid
uids
        2000: 18622
gids
        200: 18622
[download]

--
g r i n d e r

Comment on Revisiting array context Select or Download Code

Replies are listed 'Best First'.
(tye)Re: Revisiting array context by tye (Sage) on Jun 18, 2001 at 10:22 UTC
Yes, Perl is missing some handy idioms. You can use mapcar in this situation, though it isn't a perfect fit: `use mapcar; mapcar { $_[0]{$_[1]}++ } [\%user,\%group], [(stat $file)[4,5]];` [download] - tye (but my friends call me "Tye")	[reply] [d/l]
Re: Revisiting array context by bwana147 (Pilgrim) on Jun 18, 2001 at 16:07 UTC
`my( $uid, $gid ) = (stat $file)[4,5]; ++$user{$uid}; ++$group{$gid};` [download] Actually, you can do that in one line: `sub { ++$used{shift(); ++$group{shift()}; }->((stat $file)[4,5]);` [download] Of course, just because you can doesn't mean you should. --bwana147	[reply] [d/l] [select]
Re: Revisiting array context by John M. Dlugosz (Monsignor) on Jun 18, 2001 at 02:29 UTC
If += takes two arguments (as does +) it should be possible to overload it now, with Perl 5, if you bless one of the two arguments. That is, don't add two lists, but add two list references, at least one of which is a trivial class (you don't need to define any members or anything, just bless it into a package so overloading kicks in). Now, if you are going to bless things anyway, why not develop it a little into a "numeric vector" class, that supports operator + overloading etc. and members like inner product, absolute value, and other vector things? Also today you can use `mapcar` to sum your vectors. Now back to the incrementing of hash slices: In order to make it worthwhile, you need both a generalized hash slice and a way to apply the ++ to all the members in it. I think functional programming constructs can do a lot of that in Perl 5, and I hope it is fairly simple and natural in Perl 6. — John	[reply] [d/l]
Re: Revisiting array context by DrZaius (Monk) on Jun 18, 2001 at 04:33 UTC
You're algo isn't going to be limited by the cpu for incrementing integers. Most of the time you are going to be blocking on your harddrive. Why not use threads instead? You could have one thread reading off the file system and caching the stat info while the other thread is popping off your cache. If you did this, I'm going to guess, most of the time your incrementing thread is going to be waiting anyway. If you are running this on a linux system, you will get better performance (not really, but your script will run faster) if you do this: `find .; ./count.pl` [download] Linux will most likely cache most of the info into memory :) Some other os's will do this too ;)	[reply] [d/l]
Re: Revisiting array context by mattr (Curate) on Jun 18, 2001 at 14:38 UTC
Didn't realize for loops were right out. Of course, Math::Matrix has a for loop inside it in the code I suggested. But I/O and process nice levels are going to be the slowdowns for this job. If you want to parallelize adding, you might be able to take advantage of parallelism within a single processor (perhaps it adds four bytes at a time if you use bit vectors) or grab more processors with threads or processes, the latter of which might be able to squeeze some more niced cpu time out. Perhaps Parallel::ForkManager, which suggests its use when downloading thousands of files, would be useful. I suppose you would try to break the problem into pieces big enough to be worth starting a new process to get them, and adding up their results, or rather, just concatenating the bit vectors they return. The adding itself doesn't strike me as being very time consuming.	[reply]