in reply to Find common substrings

Super quick rant: I oppose computer science professors that try to use Perl in order to teach fundamental programming skills. Students get confused between data structures, algorithms, and decomposing operations into functions. Then they receive an assignment like this that could be solved in minutes only to labor over it for hours. The result? They learn to hate Perl when they should actually love it. Let them hate assembly code. If you are a professor and are reading this I recommend providing partial solutions instead.

If this is not homework I apologize but, hey, it looks like homework. Anyway, a previous poster said the following,

ALGORITHM: find_common INPUT: list of files (file_list) DECLARE: hash of IDs (id_hash) CALL: get_data WITH references to file_list and id_hash CALL: get_common WITH reference to id_hash

Something takes in two files (exactly two, not a list of arbitrary length) and compares the records that match in the second file with the same in the first file and reports the values from the first that are not identical in the second in reverse order of appearance (notice in the output how 6t6v_A appears first when in fact it is second when reading from left to right in the source file).

Without a succinct and testable understanding of the problem you cannot and should not endeavor to build an algorithm to solve it. I do not know that I require a hash of ids...why not use two arrays...one for the source data and another for the test data?

#!/usr/bin/perl -w use strict; my $first = 0; my ($keynum, @stans, @procs, @resultBucket) = (undef, (),(),()); my $usefulRegex = qr~ID([^\:]+):(.*)$~; while (<DATA>){ ++$first && next if (m/^FILE/ || (m/^\s*$/)); if ($first >= 2) { #processing the data if(m/$usefulRegex/){ $procs[$1] = $2; if ($stans[$1] ne $procs[$1]) { print qq~$1 lines are different\n~; } } } else { #storing the core for comparison if(m/$usefulRegex/){ $stans[$1] = $2; } } print; } print qq~first: $first\n~; __DATA__ FILE1 ID1:6qq5_A|14~~6qq5_B|14~~6qq6_A|14~~6qq6_B|14~~6t6v_A|14 ID2:7d5p_A|14~~7d5q_A|14 FILE2 ID1:6qq5_A|15~~6qq5_B|15~~6qq6_A|14~~6qq6_B|15~~6t6v_A|14 ID2:7d5p_A|14~~7d5q_A|12

Note that the above code does not solve your problem. It solves an interesting related problem which is this: Between file1 and file2 are the individual records the same or different? Well, originally they are not records at all -- they are strings and a simple string comparison gives the answer.

The 'secret sauce' of computer science is to solve problems by solving sub-problems first. If the two lines are indeed exactly the same there is no additional work to do. I only want differences. For such a few elements the time difference might not be worth it, but with lots of data you might consider not jumping right into splitting up strings into records when all you want to isolate are differences.

In my code you might immediately notice that we are processing the records identically so there is no need to have two separate lines that split the line up between id and record data. You might be able to do it once after eliminating the skipped lines and assign the source data when the flag ($first) says to do this or you could use this as an opportunity to implement functions that take alternate inputs.

A further optimization is that you really don't need the match at all (a dead giveaway is the .*...):

perl -e '$aline = qq~foo1: fie fo fum~; my ($aid,$rec) = split /:/,$a +line; $aid=~s/^[a-z]+//; print qq~$aid...$rec\n~;' + 1... fie fo fum

Wrap Up

The fun part of Perl is solving problems in different ways that seem elegant or 'right' to you but still solve the problem. The latter is the important part. It is perfectly acceptable to produce a piece of crud in your first pass so long as you can prove the result is correct. Then you can concentrate on writing far more elegant, succinct, and 'module quality' Perl code.

Celebrate Intellectual Diversity