in reply to Re^4: regular expessions question: (replacing words)
in thread regular expessions question: (replacing words)

Does your data fit into memory? If not, it gets more complicated (or you just have to wait a long time for the data file to get read dozens of times). You would either have to store it into a database or compress it (i.e. 'z' is 1, not-z is 0, so that every element uses just one bit)

If yes, read the file into an Array of Arrays:

my @data; my $n=0; while ($organized=<DATA2>) { chomp; $organized=~s/(\s)\w+/$1z/g; push @{$data[$n++]}, (split /\s+/, $organized); }

Now accessing column 5 of line 2 is just a simple $data[2][5]

Now to get it easier, split your problem into easier parts. Create a subroutine that gets as parameter an arbitrary number of columns. This subroutine just counts all rows that have a 'z' in all these columns. You can do that with a loop (over the selected columns) inside a loop (over all rows).

If you got that working (test it with some simple data), just create another array, add a random number. Then repeatedly add a random number (that is not already in the array) to the array, call the subroutine with it. Do that 18 times.

Replies are listed 'Best First'.
Re^6: regular expessions question: (replacing words)
by $new_guy (Acolyte) on Sep 28, 2010 at 07:28 UTC
    Hi Jethro,

    Thanks for the explanation!

    Yes the data fits in memory! And yes it would be appropriate to say every z is 1 and non-z is 0.

    I still don't understand! How do I select two columns at random, then count only the z's that are common to all rows in the two columns. By count I meant if a z occurs in column 1 at row 6 and column 2 at row six then my count of z's would be 1. Notice my count will increase as I go down comparing the rows.

    Thanks

      That with the 0 and 1 would only have been necessary if you needed to compress the data , i.e. save memory. Which you say isn't the case.

      Ok, here is the subroutine that counts rows with all 'z' in specific columns:

      sub countrows { # First parameter is a pointer/reference to the data # Second parameter is an array of columns numbers my ($data,@f)= @_; my $count=0; foreach my $row (@$data) { my $success=1; foreach (@f) { if ($row->[$_] ne 'z') { $success=0; last; } } $count+= $success; } return $count; } my @data= ( ['z',4,'z',4,'z'],['z',4,'z',4,4],['z','z','z',4,'z'] ); print countrows(\@data,0,2),' ',countrows(\@data,1,3,2),' ',countrows( +\@data,4),"\n"; # print 3 0 2

      Now to get an array of random numbers. To make sure I don't get numbers twice I generate an array of all numbers up to the number of columns and pick (i.e. extract and delete) random numbers from that array

      sub randomarray { my ($columns,$count)= @_; my @all; push @all, $_ foreach (0..($columns-1)); my @randarray; while ($count-- >0) { push @randarray, splice(@all, int(rand(@all)),1); } return @randarray; } print join ' ',randomarray( scalar @{$data[0]} , 2 ),"\n"; print join ' ',randomarray( scalar @{$data[0]} , 3 ),"\n"; print join ' ',randomarray( scalar @{$data[0]} , 4 ),"\n"; # might print 3 4 3 0 1 0 1 4 2

      You see how I cut the problem into smaller pieces that are easier to tackle? Ok, the subroutines are still not trivial. But you should be able to connect them in a sensible way to solve your problem