yungGH has asked for the wisdom of the Perl Monks concerning the following question:

My only experience with Perl is reading the first 11 chapters of Learning Perl 3rd edition last weekend. I'm running ActiveState's win32 5.005_03 version of Perl. I've worked out many exercises from Learning Perl, but I'm having trouble figuring out how to start (and finish) the following:

Take a tab separated text file with approximately 100,000 lines. Each line has 5 fields: sampleID, subID, testtype, result, resultval. For example:

12345 543D3 17 1 12.3 12345 543D3 17 2 9 12345 543D3 18 1 17.2 45678 543D3 17 1 12.3 45678 543D3 17 2 9 67890 775G2 17 1 12.3 67890 775G2 17 2 9 67890 775G2 18 1 17.2

I would like to transform the file to the following:
12345 543D3 17 1 12.3 17 2 9 18 1 17.2 45678 543D3 17 1 12.3 17 2 9 67890 775G2 17 1 12.3 17 2 9 18 1 17.2
Then I would like to perform pairwise line comparisons in the transformed file to determine if all of the test results of two lines are the same. All three lines in the transformed file match at all test results determined in the example above. Of course there would be many lines in the transformed file that don't match the test results of lines 1 - 3.

Example report:
sampleID test results match: 12345, 45678, 67890
sampleID test results match: xxxxx, yyyyy
subID test results match: 543D3
subID test results match: xxxAx

Where to start: I've learned enough to open and read each line in a file (to print it, add to array or hash, etc). I've learned how to use simple regular expressions to write out a new file with all lines that match a specific string. The leap I need to make is how to takes several lines from a file and write a new single line (the lines would each have the same value in the sampleID field), and how to perform pairwise comparisons of one line in a file against all other lines in a file (and then take the second line and compare against all other lines, etc).

BTW this is not a homework problem. Where should I start? A particular tutorial or manpage? Any code example would be truly appreciated.

Respectfully,
yungGH

Edited by mirod, 2003-02-13: changed the title

Replies are listed 'Best First'.
Re: First Post
by BrowserUk (Patriarch) on Feb 13, 2003 at 04:07 UTC

    As the data you wish to compare is the accumulated (testtype, result & resultval) and the information you wish to extract are the sampleID's and subID's, possibly the best method would be to build a hash of array's of arrays (HoAoA) as you read your data in. The key would be the accumulated test data, and the array of arrays would contain one array of sampleID's and one of subID's. Doing it this way, there is no searching to be done once the file has been read in and the structure built. For your sample data this might look like this

    %data = { '17 1 12.3 17 2 9 18 1 17.2' => [ [ 12345, 67890], [ 543D3 +, 775G2 ] ], '17 1 12.3 17 2 9' => [ [ 45678 ], [ 543D3 ] ], }

    The values for your report can then be read out of the nested arrays directly with no further searching, sorting or matching.

    Assuming that your input file is as you have shown it: sorted by sampleID and each set of values follow in a consistant order, building this data structure is a simple one-pass linear affair.

    1. open file
    2. Initialise $prevID, $results, @sampleIDs, @subIDs, %data;
    3. Read a line from the file <FILE>, while lines to read
    4. split into $sampleID, $subID, @rest
    5. If the $sampleID matches the previous, join the latest set of results to the previous and push the two id's into their respective arrays. Goto next line;
    6. else Add an entry to the hash using the accumulated $results string as the key and pushing an anonymous array containing

      [ [@sampleIDs], [@subIDs] ].

      Set $prevID = $sampleID; @sampleIDs = $sampleID; @subIDs = $subID; $results = $rest;

    7. loop;
    8. close file;
    9. for each of the values in %data;
    10. @{$_->[0]} is the array of sampleID's that had this set of matching results
    11. @{$_->[1]} is the array of subID's that had these results.

    Hopefully that pseudo code and the other answers will give you enough clues to get you going. If you get stuck, come back with what you have and someone will nudge you along:)


    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: First Post
by belg4mit (Prior) on Feb 13, 2003 at 02:20 UTC
    I think perhaps perllol might helpful. It seem the first two columns are your key, and you wish to concatentate lines with the same key. So iterating over all records, push the data columns of a line onto a list stored in a hash record whose key is the first two columns.
    push @{$hash{"$line[0]_$line[1]"}, [@line[2..4]];

    --
    I'm not belgian but I play one on TV.

Re: First Post
by BUU (Prior) on Feb 13, 2003 at 02:13 UTC
    If you can seperate that values your comparing into two arrays (one for each file) then you could do something like this:
    grep{$x=$_;grep{$_eq$x}@array1} }@array2;
    Which will return a list of all the elements in @array2 that match an element in array1