in reply to Re: Bit handling in Perl
in thread Bit handling in Perl

Thank you BrowserUk! I am sorry if I was not clear earlier. My first file has 880 columns of 1/0 running through several rows, as high as 50 million. My second file has a single row of 880 columns. I want to compare every bit in file 2 to correspoding bit in a row in file 1 and calculate a value based on the comparison for that row and repeat the process for all rows.

I have both files in csv format that are huge and I want them converted to binary such that every column in my file will be a bit and not a byte in the binary file. Won't pack convert each column into bytes? As the column holds only 1 or 0 I want them in one bit. I want the comparison made and values calculated on the converted binary files.

I hope this makes sense.

Replies are listed 'Best First'.
Re^3: Bit handling in Perl
by BrowserUk (Patriarch) on Oct 11, 2014 at 07:00 UTC
    Won't pack convert each column into bytes?

    No. When pack is used with the 'b' template it converts (packs!) each 0 or 1 in the input string to a single bit.

    So for your 880 field CSV, the output is a 110 byte string.

    Once you've converted both files to binary format, you can compare two records (count the number of bits set in both strings) using:

    my $bitsInCommon = unpack '%32b*', ( $record1 & $record2 );

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I tried to find the size of the variable after storing a 0 and pack equivalent of it and the size doubled up! Please find my code below. When I assigned $a=0 I got 24 bytes and on packing it I got 48 bytes.

      #! perl -slw use strict; use Devel::Size qw(size); $a=pack 'b*',0; print "Size of scalar is " .size($a) . " bytes\n";

        That's because the packing process has caused the IV (integer variable) that held the 0 value to be converted to a PV (string variable) which carries extra, behind-the-scenes overhead:

        [0] Perl> $a = 0; print size $a;; 24 [0] Perl> $b = pack 'b*', 0; print size $b;; 56 [0] Perl> print Dump $a;; SV = IV(0x3e74270) at 0x3e74278 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 0 [0] Perl> print Dump $b;; SV = PV(0x11c110) at 0x3e6c6d8 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x3ecd828 "\0"\0 CUR = 1 LEN = 8

        As you can see, the PV (string) has a couple of extra internal control fields (CUR & LEN) plus a pointer that points to the actual string which itself consists of one 0 byte to hold the (single) bit -- you cannot pack less that 8 bits; its teh way computers work! -- and a second 0 byte which is to ensure that the string is "null terminated" -- as all strings must be for many C language library routines to work; internally Perl uses the C runtime library.

        But, you are being deceived by the simplicity of your test. Let's try something a little more representative of your application:

        [0] Perl> $x = join '', map{ rand() < 0.5 ? 0 : 1 } 1 .. 880;; [0] Perl> print size $x;; 936 [0] Perl> print Dump $x;; SV = PV(0x3e964e0) at 0x337708 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x3e8d8c8 "0011011011100111000000101011100011111101100011111101 +101111001100000111111011010111101100110101101001011000010100111001110 +110000000100010001111011100010111111111010101001010011110111100110001 +111100100110000101100001001011100110100011000011011001100101110111100 +000010100011110010101111110101011111100100100000111110110001101111110 +111000100000111010101000010001000001110100111110010100100101011011100 +1010110000110001001011101010010000010001011011 CUR = 880 LEN = 888 [0] Perl> $y = pack 'b*', $x;; [0] Perl> print size $y;; 160 [0] Perl> print Dump $y;; SV = PV(0x3e967b0) at 0x3e6c3f0 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x3dff038 "l\347@\35\277\361\3333\370\2557ki(\347\6D\274\243\37 +7*\345=\343\223\241!\235\305\260\231\356\201\342\251_\375$\370\306~Gp +\25\"\270|Jj\247\206\221\256\4\321\16\262?\334\355\22\304S\370nh\325% +\232K\376\235\34\35L\377\330\4a6\314_\314\222\336\373\375\371m[\246\2 +46g\326e\37\36\23j5\2\346\324O\v|\272s\236\3"\0 CUR = 110 LEN = 112

        And there you have it. The fixed internal overhead hasn't changed, but the length of the string (CUR) has reduced from 880 to 110; and teh actual memory use has reduced from 936 bytes to 160 bytes.

        And if that all still doesn't make any sense to you; download and spend a week reading this and then come back with any remaining questions.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.