Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have a csv file having information running on multiple rows each having 880 columns as 1s and 0s. Its basically to show the occurence of 880 properties each denoted in a column with 1 for entities characterised in each rows. I want to write the contents of the csv file to a binary file.

The information I am trying to convey in a row can be easily represented by one bit for a column so I want to have a binary file with smaller size than original csv. Also I want to compare every bit in one row in the binary file to respective bits across all rows in a different file and calculate a property value for each row. The number of rows in second file can be as high as 50 million so I want the property calculation to be faster.

Please tell me if this can be accomplished through PERL and the functions I need to explore for the same. Thank you!

Replies are listed 'Best First'.
Re: Bit handling in Perl
by BrowserUk (Patriarch) on Oct 10, 2014 at 07:30 UTC

    The first part can be done this way:

    #! perl -slw use strict; binmode STDOUT; while( <DATA> ) { tr[,\n][]d; print pack 'b*', $_; } __DATA__ 1,0,0,1,0,0,1,1,0,0,0,0,1,1,1,0,1,0,1,0,1,0,0,0,1,1,1,0,0,0,0,1,0,0,1, +1,0,1,0,0,0,1,1,1,1,1,0,1,1,1,1,1,0,0,0,0,1,1,1,1,1,0,0,1,0,1,0,0,0,1 +,1,1,0,1,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,0,1, +0,0,1,1,0,1,0,1,1,0,1,0,0,1,0,0,1,1,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,0,1 +,0,0,0,1,0,0,1,1,1,1,1,1,1,1,0,1,0,1,1,0,1,1,1,0,0,1,0,1,0,1,0,1,1,1, +1,0,1,1,0,1,0,0,1,0,1,1,0,1,1,0,1,0,1,1,0,1,1,0,1,0,0,0,0,0,1,1,1,0,1 +,1,0,1,1,0,1,0,0,1,0,1,0,1,0,0,0,1,1,1,1,1,0,1,0,1,0,1,0,0,1,1,0,0,1, +0,0,1,0,1,1,1,1,0,1,1,0,1,0,1,1,0,1,1,1,0,1,1,0,0,1,0,0,0,1,0,1,1,1,1 +,0,0,1,0,0,0,0,0,1,0,0,1,1,1,1,1,1,1,0,0,0,1,1,1,1,0,0,0,0,0,1,0,1,0, +0,0,1,1,0,1,0,1,0,1,0,1,1,1,0,0,0,0,0,1,0,1,1,0,0,0,0,0,1,1,0,0,0,1,0 +,1,1,1,1,1,0,0,1,0,0,0,1,1,1,1,0,0,1,1,0,1,0,1,0,0,1,1,1,0,1,0,0,1,1, +0,0,1,0,0,0,0,1,1,1,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0,1,1,0,0,0,1,0,0,1,0 +,0,1,0,0,0,0,0,1,1,1,1,1,1,0,1,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0, +1,0,0,1,1,1,0,1,1,0,0,1,0,1,0,1,1,1,1,1,0,0,0,1,0,0,1,0,0,0,1,1,0,0,0 +,0,0,1,1,0,0,1,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,1,0,0, +1,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,1,0,0,1,0,0,1,0,1,1,0,0,0,1 +,1,0,0,1,1,1,0,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,1,1,0,1,0,0,0,0,0,0,0,0, +0,0,0,1,0,1,1,1,0,0,0,1,1,0,1,1,1,0,0,0,0,0,1,1,1,0,1,1,0,1,1,0,0,0,0 +,0,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,0,0,1,0,0,1,1,0,1,1,1, +1,0,1,0,0,1,0,0,1,0,0,0,1,0,1,0,1,0,1,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0 +,0,0,1,0,0,1,0,1,0,1,0,0,0,0,1,0,1,0,1,1,1,0,0,0,0,1,1,1,0,1,0,1,1,1, +0,0,1,0,0,1,1,1,1,0,0,0,1,1,1,1,0,1,1,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,0 +,0,0,0,0,1,0,0,1,1,0,1,1,0,0,0,1,0,1,1,1,0,0,1,0,1,0,0,1,0,1,0,0,0,1, +0,1,0,1,1,1,1,1,0,0,0,0,0,1,0,0,1,0,0,1,1,0,1,1,0,0,0,1,1,0,0,0,1,1,1 +,0,1,1,0,1,1,1,1,0,0,0,0,1,1,0,1,0,1,0,0,0,1,0,1,1,0,0,1,1,1,0,1,0,0, +1,1,1,1,0,1,0,1,0,0,1,0,0,1,1,1,1 0,1,0,1,1,1,1,1,0,0,0,0,1,1,1,1,0,0,1,0,1,1,1,1,1,1,1,0,1,0,0,1,1,1,1, +0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,1,1,0,0,1,0,0,0,0,1,0,1,1,1,1,0,0,1,0 +,0,1,1,1,1,1,0,1,0,1,0,1,1,0,1,0,0,0,1,0,0,0,1,1,1,0,0,0,1,1,0,1,0,1, +0,0,0,1,1,0,1,0,1,0,1,1,1,1,0,0,0,1,0,1,0,1,1,1,0,1,1,1,0,0,1,0,1,0,0 +,1,0,0,1,1,0,0,0,0,1,1,0,0,1,0,1,1,0,1,1,1,1,0,0,1,0,0,0,1,0,0,1,0,0, +1,0,1,1,1,0,1,0,0,1,0,1,1,1,1,0,1,1,1,0,1,1,1,1,0,1,1,0,0,0,1,1,0,0,0 +,0,1,0,0,1,1,0,0,1,0,1,1,0,0,1,0,1,1,0,0,1,1,1,1,0,1,0,1,0,0,0,1,1,1, +1,0,1,1,1,0,0,1,0,0,0,0,0,1,1,1,0,0,1,0,1,1,0,0,0,1,1,1,0,1,0,0,1,1,1 +,1,0,0,1,0,1,1,1,1,1,1,0,1,0,1,0,1,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,1, +0,1,1,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,1,1,1,0,0,1,1,0,0,0,0,1,0,1,0,0,1 +,1,1,1,1,0,1,1,0,1,0,1,0,1,1,1,1,0,1,0,0,0,1,0,0,0,1,1,1,1,0,1,0,0,1, +1,0,1,0,0,1,1,0,0,1,0,1,1,1,1,0,1,1,0,1,1,1,0,0,0,0,0,1,0,0,1,0,0,0,1 +,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,0,1,0,1,1,0,1, +1,1,0,1,0,0,0,0,0,0,1,1,1,0,1,0,1,0,0,0,0,1,1,0,1,0,0,1,0,1,0,1,1,0,1 +,1,1,0,0,0,0,0,1,0,0,1,1,0,0,0,0,1,1,0,0,0,1,0,0,1,0,1,0,0,0,1,1,0,1, +1,1,0,1,0,1,1,1,1,0,1,0,1,0,1,1,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0 +,0,1,0,1,1,1,1,1,0,1,1,0,0,0,0,0,0,1,0,1,1,1,1,0,0,1,0,0,1,1,1,0,0,1, +1,1,0,1,1,1,0,0,0,1,0,0,1,1,1,1,1,1,1,0,0,1,0,1,1,1,1,1,0,1,1,1,0,0,1 +,1,0,1,0,1,0,0,0,0,0,0,0,1,1,1,1,0,0,1,0,0,0,0,1,1,1,0,0,0,0,1,0,0,0, +0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,1,1,0,1,0,1,1,1,0,1,1,1,0,1 +,0,1,0,0,1,0,1,0,0,0,0,0,1,0,0,1,1,1,1,1,1,1,1,1,0,1,1,1,0,1,0,1,1,0, +0,1,0,0,0,1,1,0,1,1,0,0,1,0,0,1,0,0,0,0,0,0,1,0,1,1,1,0,1,0,0,1,1,1,0 +,0,1,0,1,1,0,0,1,1,0,1,1,0,1,0,0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1, +0,0,1,0,0,0,1,1,1,0,1,0,0,1,1,0,0,0,1,0,1,1,1,0,1,0,1,1,1,0,1,0,0,0,0 +,1,0,0,0,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,0,1,1,1,1,1,1,1,1, +1,1,0,1,1,0,0,0,1,1,1,0,1,0,1,0,1

    That assumes you want each binary record terminated with a newline. (Remove the 'l' from the shebang line if not.)

    For the second part you'll have to describe the format of the second file in more detail?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Thank you BrowserUk! I am sorry if I was not clear earlier. My first file has 880 columns of 1/0 running through several rows, as high as 50 million. My second file has a single row of 880 columns. I want to compare every bit in file 2 to correspoding bit in a row in file 1 and calculate a value based on the comparison for that row and repeat the process for all rows.

      I have both files in csv format that are huge and I want them converted to binary such that every column in my file will be a bit and not a byte in the binary file. Won't pack convert each column into bytes? As the column holds only 1 or 0 I want them in one bit. I want the comparison made and values calculated on the converted binary files.

      I hope this makes sense.

        Won't pack convert each column into bytes?

        No. When pack is used with the 'b' template it converts (packs!) each 0 or 1 in the input string to a single bit.

        So for your 880 field CSV, the output is a 110 byte string.

        Once you've converted both files to binary format, you can compare two records (count the number of bits set in both strings) using:

        my $bitsInCommon = unpack '%32b*', ( $record1 & $record2 );

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Bit handling in Perl
by Ea (Chaplain) on Oct 10, 2014 at 08:37 UTC
    Another option is to look at vec which is a little more general and may give you access to a column represented in the binary file without unpacking. I have no deep knowledge of which is the better approach and using pack as above is certainly a good option.

    Sometimes I can think of 6 impossible LDAP attributes before breakfast.

      If I understand the OP correctly; He seems to want to compare all of the bits in each record in file one, with the corresponding bits in every (50M) record of file two.

      Whilst iterating over the 880 bits in each vector using a loop and vec is certainly possible; it is really quite slow.

      Indeed, it is much, much slower than performing such comparisons en-masse using perl's bit-wise boolean string operations.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      Note that the "b" template for pack packs in ascending bit order within each byte which is like vec so it probably makes sense to create each record using pack and access bits (columns) within a record using vec.

      Cheers,

      JohnGG