snape has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, <\p>

I am having a very large tab delimited file and I am trying to parse the odd and even columns separately. I have stored the first column in the array and want to store the rest of the columns in two separate hash of arrays or some other data structure so that I can compare them in a quick time. Can you please help in this regard ? Also, it will be great if you can advise me some data structure which I can use that doesn't have any problems in storing as what arrays has.

My code:
while (<$outFile>) { chomp; $lno++; @line = split(/\t/, $_); ## $line[0] is for the first column push (@idmatrix, $line[0]) if (0 == $lno % 2){ ## Trying to store the even columns apart from the first c +olumn values } else if (1 == $lno % 2){ ## Trying to store the odd columns apart from the first co +lumn values } } } shift @idmatrix; ## To delete the first row (Is there any other way to ## delete the first row of the file)

Thanks. <\p>

  • Comment on Extract the odd and even columns seperately in the hash of arrays or some other data structure apart from arrays
  • Download Code

Replies are listed 'Best First'.
Re: Extract the odd and even columns seperately in the hash of arrays or some other data structure apart from arrays
by toolic (Bishop) on Jan 24, 2010 at 00:09 UTC
    You will get good advice if you provide more information, such as:
    • How many even columns do you have?
    • How many odd columns do you have?
    • Are you just comparing items on the same line? If so, there may be no need to read all lines of your large input file into memory at once.
    • Or, do you need to compare an item on one line to items on all other lines?
    • Are you comparing strings or numbers?
    • What are you using the 1st column for?
    • Show a small sample of your input (fewer than 10 lines, fewer than 10 columns). Creating a small sample for yourself will make it easier for you to debug your own code, so the extra effort should pay off.
    See also perldsc.

    Update: here is some example code to read the whole file into a data structure (HoHoA):

    use warnings; use strict; use Data::Dumper; my %data; while (<DATA>) { my ($id, @cols) = split; for my $i (0 .. $#cols) { my $type = ($i % 2) ? 'odd' : 'even'; push @{ $data{$id}{$type} } , $cols[$i]; } } print Dumper(\%data); __DATA__ a 1 2 3 4 5 6 b 9 8 7 6 5 4
    prints:

      Answers to ur questions are as follows:

      a. How many even columns do you have?

      b. How many odd columns do you have?

      c. Are you just comparing items on the same line? If so, there may be no need to read all lines of your large input file into memory at once.

      d. Or, do you need to compare an item on one line to items on all other lines?

      e. What are you using the 1st column for?

      Ans a-e: There are more than 1 million rows and columns. The first row is about the entity names and therefore, I want to remove that. Then, I want to compare the entire row with all the other rows and would like to know where I have same match. I am also not reading the entire file at once. I am using file handler and while loop for reading it one per line.

      f. Are you comparing strings or numbers? Ans f: I am comparing the number or taking the difference between the two.

      g.Show a small sample of your input (fewer than 10 lines, fewer than 10 columns). Creating a small sample for yourself will make it easier for you to debug your own code, so the extra effort should pay off.

      Iid A1 A2 A3 A4 A5 A6 A7 A8 12 1 2 1 2 1 1 1 1 12 2 1 2 2 1 1 1 1 15 2 1 2 2 1 1 1 1 15 2 1 2 1 1 2 1 1 16 2 1 2 1 1 2 1 1 16 2 1 2 2 1 1 1 1 19 2 1 2 1 1 2 1 1 19 1 2 1 2 1 1 1 1 116 1 2 2 2 1 1 1 1 116 2 1 2 1 1 2 1 1

      You will see that the 2 rows have same name, so I would like to take see the match between the rows at same column position i.e. first row of 12 with both the rows of 15, 16, 19 and 116. Similarly second column of 12 with both the columns of 15, 16, 19 and 116. Thanks a lot for sharing your views.

        Please give us an "expected output".
Re: Extract the odd and even columns seperately in the hash of arrays or some other data structure apart from arrays
by johngg (Canon) on Jan 24, 2010 at 01:16 UTC

    Something like this may do what you want.

    $ perl -E ' > @arr = ( 0 .. 9 ); > say qq{@arr}; > push @{ $_ % 2 ? \ @odd : \ @even }, $arr[ $_ ] > for 0 .. $#arr; > say qq{@even\n@odd};' 0 1 2 3 4 5 6 7 8 9 0 2 4 6 8 1 3 5 7 9 $

    I hope this is helpful.

    Cheers,

    JohnGG

Re: Extract the odd and even columns seperately in the hash of arrays or some other data structure apart from arrays
by BrowserUk (Patriarch) on Jan 24, 2010 at 19:58 UTC

    I have to say, that even after you've posted three different explanations of your requirements, they are still far from clear. This produces what I think you are after:

    #! perl -slw use strict; my %data; while( <DATA> ) { chomp; my @cols = split ' '; push @{ $data{ $cols[ 0 ] } }, pack 'C*', @cols[ 1 .. $#cols ]; } my @keys = sort{ $a <=> $b } keys %data; for my $i ( 0 ..$#keys ) { my $key1 = $keys[ $i ]; for my $keyset1 ( @{ $data{ $key1 } } ) { for my $key2 ( @keys[ $i+1 .. $#keys ] ) { for my $keyset2 ( @{ $data{ $key2 } } ) { my $mask = $keyset1 ^ $keyset2; next unless 1+ index $mask, chr( 0 ); printf "%3d : %3d : ", $key1, $key2; print join ', ', map{ substr( $mask, $_, 1 ) eq chr(0) ? $_+1 : () } 0 .. length( $mask )-1; } } } } __DATA__ 12 1 2 1 2 1 1 1 1 12 2 1 2 2 1 1 1 1 15 2 1 2 2 1 1 1 1 15 2 1 2 1 1 2 1 1 16 2 1 2 1 1 2 1 1 16 2 1 2 2 1 1 1 1 19 2 1 2 1 1 2 1 1 19 1 2 1 2 1 1 1 1 116 1 2 2 2 1 1 1 1 116 2 1 2 1 1 2 1 1

    Gives:

    c:\test>819256 12 : 15 : 4, 5, 6, 7, 8 12 : 15 : 5, 7, 8 12 : 16 : 5, 7, 8 12 : 16 : 4, 5, 6, 7, 8 12 : 19 : 5, 7, 8 12 : 19 : 1, 2, 3, 4, 5, 6, 7, 8 12 : 116 : 1, 2, 4, 5, 6, 7, 8 12 : 116 : 5, 7, 8 12 : 15 : 1, 2, 3, 4, 5, 6, 7, 8 12 : 15 : 1, 2, 3, 5, 7, 8 12 : 16 : 1, 2, 3, 5, 7, 8 12 : 16 : 1, 2, 3, 4, 5, 6, 7, 8 12 : 19 : 1, 2, 3, 5, 7, 8 12 : 19 : 4, 5, 6, 7, 8 12 : 116 : 3, 4, 5, 6, 7, 8 12 : 116 : 1, 2, 3, 5, 7, 8 15 : 16 : 1, 2, 3, 5, 7, 8 15 : 16 : 1, 2, 3, 4, 5, 6, 7, 8 15 : 19 : 1, 2, 3, 5, 7, 8 15 : 19 : 4, 5, 6, 7, 8 15 : 116 : 3, 4, 5, 6, 7, 8 15 : 116 : 1, 2, 3, 5, 7, 8 15 : 16 : 1, 2, 3, 4, 5, 6, 7, 8 15 : 16 : 1, 2, 3, 5, 7, 8 15 : 19 : 1, 2, 3, 4, 5, 6, 7, 8 15 : 19 : 5, 7, 8 15 : 116 : 3, 5, 7, 8 15 : 116 : 1, 2, 3, 4, 5, 6, 7, 8 16 : 19 : 1, 2, 3, 4, 5, 6, 7, 8 16 : 19 : 5, 7, 8 16 : 116 : 3, 5, 7, 8 16 : 116 : 1, 2, 3, 4, 5, 6, 7, 8 16 : 19 : 1, 2, 3, 5, 7, 8 16 : 19 : 4, 5, 6, 7, 8 16 : 116 : 3, 4, 5, 6, 7, 8 16 : 116 : 1, 2, 3, 5, 7, 8 19 : 116 : 3, 5, 7, 8 19 : 116 : 1, 2, 3, 4, 5, 6, 7, 8 19 : 116 : 1, 2, 4, 5, 6, 7, 8 19 : 116 : 5, 7, 8

    Note: Somewhere you say "There are more than 1 million rows and columns.". If by that you mean (say) 100,000 rows of 10 columns or 10,000 rows x 100 columns; then the above code may work if you have 2 or 3 GB of RAM.

    If you actually mean 1,000,000 rows X 1,000,000 columns, then you've got a real problem on your hands.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks a lot. I was actually looking for a way to use hash of arrays and u helped me in that. I appreciate it. I can actually build it using it and yeah I have 1,000,000 rows X 1,000,000 columns and sorry for complex write up.

      Hi,

      I didn't understand following parts of your program:

      1.

      push @{ $data{ $cols[ 0 ] } }, pack 'C*', @cols[ 1 .. $#cols ];

      2.

      my $mask = $keyset1 ^ $keyset2; next unless 1+ index $mask, chr( 0 ); printf "%3d : %3d : ", $key1, $key2; print join ', ', map{ substr( $mask, $_, 1 ) eq chr(0) ? $_+1 : () } 0 .. length( $mask )-1;

      I didn't understand the concept of map and $mask. When I trying to print the hash of arrays it is not giving me any numbers. Can you please help with it. Also, how can I use String::LCSS_XS on the hash of arrays (%data) to get the longest common substring between the various of rows apart from the ones having same id number.

      Thanks.
        1. push @{ $data{ $cols[ 0 ] } }, pack 'C*', @cols[ 1 .. $#cols ];

          All your data values appear to be small positive integers. so rather than store them as an array, I chose to save some space (and given the size of the dataset you mentioned), and gain some considrable performance, by storing them as an array of chars. Ie. As strings.

          This will only work if all your values are in the range 0 .. 255. However, you could also use pack 'S' for values 0 .. 65535 to gain similar benefits. If all your values are very small, then you might pack them even more tightly, though the benefits are less easily achieved.

          The benefits of the compacted representation include:

          • Space:

            An ordinary array of 1000 integers requires just under 32k on my 64-bit system. A thousand byte string requires just 1056 bytes.

            That's less than 4% of the space requirement.

          • Speed:

            I can detect if any two corresponding bytes in two 1000-bytes string are equal in 0.000003 seconds.

            Doing the same thing using two 1000-element arrays takes 0.00025 seconds.

            That's just 1.2% of the time. Or over 80 times faster.

        2. my $mask = $keyset1 ^ $keyset2;

          This exclusive-ORs the two arrays of ints being compared, together (as strings). This effectively compares each corresponding pair of integers in both arrays (strings) in a single, very fast operation. If the two integers are equal, then the corresponding byte in $mask will be set to chr(0). If the are different, the corresponding byte will be non-zero.

          This allows a a very fast check for there being any correspondance between the arrays...

        3. next unless 1+ index $mask, chr( 0 );

          This uses index to search $mask for any zero byte. If none is found, then it skips to the next pair of arrays.

          If one (one more) zero bytes are found...

        4. printf "%3d : %3d : ", $key1, $key2;

          Print the ids of the two datasets being compared and ...

        5. print join ', ', map{ substr( $mask, $_, 1 ) eq chr(0)  ? $_+1 : () } 0 .. length( $mask )-1;

          Print a list of all the column (array index+1) positions where the corresponding integers are equal.

          Ie. For each character position in the mask, if the byte equals chr(0), then the corresponding integers in the datasets are a match. As the mask indexes are zero-based, but the column indexes are 1-based, add 1 to each matching index.

          join them with commas and print them out.

        I hope that will clarify both the logic, and the reasoning behind the code.

        Also, how can I use String::LCSS_XS on the hash of arrays (%data) to get the longest common substring between the various of rows apart from the ones having same id number.

        If all your integers are in the range 0 .. 255, and can therefore be packed into strings, then it is very fortuitous, as String::LCSS_XS only operates upon strings :)

        It also makes modifying my posted code to use it very simple:

        #! perl -slw use strict; use String::LCSS_XS qw[ lcss ]; my %data; while( <DATA> ) { chomp; my @cols = split ' '; push @{ $data{ $cols[ 0 ] } }, pack 'C*', @cols[ 1 .. $#cols ]; } my @keys = sort{ $a <=> $b } keys %data; for my $i ( 0 ..$#keys ) { my $key1 = $keys[ $i ]; for my $keyset1 ( @{ $data{ $key1 } } ) { for my $key2 ( @keys[ $i+1 .. $#keys ] ) { for my $keyset2 ( @{ $data{ $key2 } } ) { my( $s, $i1, $i2 ) = lcss( $keyset1, $keyset2 ); printf "%4d(%4d) - %4d(%4d) : %s\n", $key1, $i1, $key2, $i2, join ', ', unpack 'C*', $s; } } } } __DATA__ 12 1 2 1 2 1 1 1 1 12 2 1 2 2 1 1 1 1 15 2 1 2 2 1 1 1 1 15 2 1 2 1 1 2 1 1 16 2 1 2 1 1 2 1 1 16 2 1 2 2 1 1 1 1 19 2 1 2 1 1 2 1 1 19 1 2 1 2 1 1 1 1 116 1 2 2 2 1 1 1 1 116 2 1 2 1 1 2 1 1

        Which produces this from your sample input:

        c:\test>819256 ## Note that the offsets are zero-based excluding the ID field #id1 offset1 id2 offset2 values 12( 3) - 15( 3) : 2, 1, 1, 1, 1 12( 1) - 15( 0) : 2, 1, 2, 1, 1 12( 1) - 16( 0) : 2, 1, 2, 1, 1 12( 3) - 16( 3) : 2, 1, 1, 1, 1 12( 1) - 19( 0) : 2, 1, 2, 1, 1 12( 0) - 19( 0) : 1, 2, 1, 2, 1, 1, 1, 1 12( 3) - 116( 3) : 2, 1, 1, 1, 1 12( 1) - 116( 0) : 2, 1, 2, 1, 1 12( 0) - 15( 0) : 2, 1, 2, 2, 1, 1, 1, 1 12( 0) - 15( 0) : 2, 1, 2 12( 0) - 16( 0) : 2, 1, 2 12( 0) - 16( 0) : 2, 1, 2, 2, 1, 1, 1, 1 12( 0) - 19( 0) : 2, 1, 2 12( 3) - 19( 3) : 2, 1, 1, 1, 1 12( 2) - 116( 2) : 2, 2, 1, 1, 1, 1 12( 0) - 116( 0) : 2, 1, 2 15( 0) - 16( 0) : 2, 1, 2 15( 0) - 16( 0) : 2, 1, 2, 2, 1, 1, 1, 1 15( 0) - 19( 0) : 2, 1, 2 15( 3) - 19( 3) : 2, 1, 1, 1, 1 15( 2) - 116( 2) : 2, 2, 1, 1, 1, 1 15( 0) - 116( 0) : 2, 1, 2 15( 0) - 16( 0) : 2, 1, 2, 1, 1, 2, 1, 1 15( 0) - 16( 0) : 2, 1, 2 15( 0) - 19( 0) : 2, 1, 2, 1, 1, 2, 1, 1 15( 0) - 19( 1) : 2, 1, 2, 1, 1 15( 2) - 116( 3) : 2, 1, 1 15( 0) - 116( 0) : 2, 1, 2, 1, 1, 2, 1, 1 16( 0) - 19( 0) : 2, 1, 2, 1, 1, 2, 1, 1 16( 0) - 19( 1) : 2, 1, 2, 1, 1 16( 2) - 116( 3) : 2, 1, 1 16( 0) - 116( 0) : 2, 1, 2, 1, 1, 2, 1, 1 16( 0) - 19( 0) : 2, 1, 2 16( 3) - 19( 3) : 2, 1, 1, 1, 1 16( 2) - 116( 2) : 2, 2, 1, 1, 1, 1 16( 0) - 116( 0) : 2, 1, 2 19( 2) - 116( 3) : 2, 1, 1 19( 0) - 116( 0) : 2, 1, 2, 1, 1, 2, 1, 1 19( 3) - 116( 3) : 2, 1, 1, 1, 1 19( 1) - 116( 0) : 2, 1, 2, 1, 1
        HTH.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.