comment on

ok I almost get it! (Update: I had a mistake here before)

The reduction was easy in STATA, sorting all the data by characteristic columns (as if they were one big binary number), then eliminating duplicates. Down to 400,000 lines of data.

Now, I've tried adapting your code and can't seem to get it to work. The XOR appears to be working, but somewhere between packing and unpacking something goes wrong.

Here is a simplification I've written with two lines of data and 8 characteristics:

#!/usr/bin/perl
use strict;

my (@patNos, @data) = 0;

my $line1 = "44444,1,1,0,0,0,1,1,1";
my @bits1 = split ',',$line1;
$patNos[0] = shift @bits1;
$data[0] = pack 'b8', @bits1;

my $line2 = "55555,0,1,1,0,0,1,0,1";
my @bits2 = split ',',$line2;
$patNos[1] = shift @bits2;
$data[1] = pack 'b8', @bits2;

print "$patNos[0] - @bits1\n";
print "$patNos[1] - @bits2\n";

my $line1 = unpack 'b8', $data[0];
my $line2 = unpack 'b8', $data[1];
my $variance = unpack '%32b*', ($data[0] ^ $data[1]);

print "\nline 1: $line1\n";
print "line 2: $line2\n"; 
print "\nvariance: $variance\n";
[download]

Which returns:

44444 - 1 1 0 0 0 1 1 1 
55555 - 0 1 1 0 0 1 0 1

line 1: 10000000
line 2: 00000000

variance: 1
[download]

I would expect $line1 and $line2 to contain the original '0' and '1' strings as characters, but they just return all '0's except for the lone '1' in $line1. I've been reading about pack and unpack all day, but can't figure out what I'm doing wrong. I also realized that if I do pack 's', the variance is calculated correctly, but I would expect this to stop working once there were more than 16 characteristics.

I love the idea of making one huge bitstring and then using substr for the comparisons. What is the best way to concatenate the bitstrings? I'm assuming then that substr can work with a bitstring the same as it would a regular string?

Also, a few questions:

unpack '%32b*', ( $data[ $first ] ^ $data[ $second ] );
[download]

What is the significance of 32 here? If I'm using a 64 bit machine, should I change it to 64? I'm writing this on a 32 bit, but it will run on a 64 bit?

Why use 5.010?

What is the advantage of setting the length of the @patNos and @data arrays at the start?

print "\r$.\t" unless $. % 1000;
[download]

Is this just printing a status update every 1000 lines?

say "\n", time;
[download]

What is this doing?

THANK YOU!!!

In reply to Re^4: Huge data file and looping best practices by carillonator
in thread Huge data file and looping best practices by carillonator

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.