Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks!
Perl newbie here, and I am faced with the following problem:
I have 2 files with likes like the following:
FILE1 ID1:6qq5_A|14~~6qq5_B|14~~6qq6_A|14~~6qq6_B|14~~6t6v_A|14 ID2:7d5p_A|14~~7d5q_A|14 FILE2 ID1:6qq5_A|15~~6qq5_B|15~~6qq6_A|14~~6qq6_B|15~~6t6v_A|14 ID2:7d5p_A|14~~7d5q_A|12

basically, common ids, like ID1 and ID2 and then, next to them, a series of strings like 6t6v_A|14 separated by ~~. Now, what I need to do is, check, for each of the IDS (in this example ID1 and ID2), which of these smaller strings are the same. In my example, the desired output would be:
ID1:6t6v_A|14 ID1:6qq6_A|14 ID2:7d5p_A|14

because these are the only commons strings for each of the two IDs. What I am planning to do is create an structure like AoA and compare each of the elements, sequentially, to see which ones are the same. Is there maybe a faster way to do this task? Any pointers/help would be greatly appreciated!

Replies are listed 'Best First'.
Re: Find common substrings
by kcott (Archbishop) on Apr 29, 2023 at 05:23 UTC
    "Perl newbie here, ..."

    Start by reading "perlintro: Perl introduction for beginners".

    "... and I am faced with the following problem: ..."

    As already pointed out, PerlMonks is not a code writing service. See also: "How (Not) To Ask A Question: Do Your Own Work".

    This was an interesting little problem, so I coded a solution that produced your "desired output". Here's the pseudocode for my solution:

    ALGORITHM: find_common INPUT: list of files (file_list) DECLARE: hash of IDs (id_hash) CALL: get_data WITH references to file_list and id_hash CALL: get_common WITH reference to id_hash FUNCTION: get_data INPUT: file_list and id_hash LOOP: file_list EACH file OPEN: file LOOP: read line EACH line remove line terminator split line INTO id and string LOOP: split string EACH substring increment id_hash-id-substring value CLOSE: file RETURN: void FUNCTION: get_common INPUT: id_hash LOOP: id_hash EACH id LOOP: id_hash-id EACH substring SKIP: id_hash-id-substring value < 2 PRINT: id : id_hash-id-substring key RETURN: void

    If you make an attempt at implementing that in Perl, I'll gladly review your effort and show you mine. There would be many ways you could code this; but do note that there is nothing complicated in the logic — if you read perlintro, you should have all of the tools necessary for this task.

    — Ken

Re: Find common substrings
by Fletch (Bishop) on Apr 29, 2023 at 00:44 UTC

    The String::LCSS_XS module may be a good starting place. Using “super search” for “longest common substring” should also return maybe useful prior threads.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: Find common substrings
by hv (Prior) on Apr 29, 2023 at 01:42 UTC

    If I correctly understand your problem, you want to find cases where two complete values are the same (the things separated by "~~"), not just any old substring.

    To find "a value I've seen before", the quickest way to do it is usually to have that value set as the key of a hash; in this case you additionally want it keyed on the ID, so you probably want to build a hash-of-hashes structure something like:

    my %values_by_id; # for each line from the first file: my($id1, $values_string1) = split /:/, $line1; for my $value1 (split /~~/, $values_string1) { # set each unique value to be true in the hash $values_by_id{$id1}{$value1} = 1; } # and then for a line from the second file: my($id2, $values_string2) = split /:/, $line2; my @shared_values; for my $value2 (split /~~/, $values_string2) { push @shared_values, $value2 if $values_by_id{$id2}{$values2}; }

    Hope this helps.

Re: Find common substrings
by eyepopslikeamosquito (Archbishop) on Apr 29, 2023 at 01:06 UTC

    Fletch has given you a nice hint to get you started.

    Please understand that Perl Monks is not a code writing service. Quoting from Gramps' reply to one of many similar recent questions:

    I know you are hoping that we might just write the whole thing for you, but you should at least show some effort. Meet us part way by showing a first attempt at solving the problem with comments indicating where you can't figure stuff out.

    See also: How do I post a question effectively?

Re: Find common substrings
by hippo (Archbishop) on Apr 29, 2023 at 11:27 UTC
    Is there maybe a faster way to do this task?

    Probably, but that depends on precisely the problem at hand. eg. every field in the example given is the same length - coincidence, or is that really a feature of the dataset? Such details can be very important.

    Another question is: does it matter if there's a faster way? Everybody likes their code to run faster but if a solution you come up with completes in a short enough time is there any reason to look for something faster, other than purely for the knowledge? It's an admirable trait but you can fall into the trap of over-analysing things when almost any approach might be fast enough.

    As you are new to Perl, I would advise you to try the approach which you have suggested (which sounds fine, by the way), see if you can code it up and then see how fast it runs. If you have 2 files with maybe 100k records in each it should run plenty fast enough (a few seconds at the very most on modern hardware) - if it takes longer than that then by all means come back, show your code and ask for more specific help.

    Good luck with your task.


    🦛

Re: Find common substrings
by tybalt89 (Monsignor) on Apr 29, 2023 at 15:16 UTC

    Some elements from the following program might be useful for solving your problem.

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11151901 use warnings; use Path::Tiny; local $_ = path('FILE1')->slurp . path('FILE2')->slurp; my %seen; while( /^(.*:)(.*)/gm ) { $seen{$1}{$_}++ and print "$1$_\n" for split /~~/, $2; }
Re: Find common substrings
by InfiniteSilence (Curate) on Apr 30, 2023 at 15:34 UTC

    Super quick rant: I oppose computer science professors that try to use Perl in order to teach fundamental programming skills. Students get confused between data structures, algorithms, and decomposing operations into functions. Then they receive an assignment like this that could be solved in minutes only to labor over it for hours. The result? They learn to hate Perl when they should actually love it. Let them hate assembly code. If you are a professor and are reading this I recommend providing partial solutions instead.

    If this is not homework I apologize but, hey, it looks like homework. Anyway, a previous poster said the following,

    ALGORITHM: find_common INPUT: list of files (file_list) DECLARE: hash of IDs (id_hash) CALL: get_data WITH references to file_list and id_hash CALL: get_common WITH reference to id_hash

    Something takes in two files (exactly two, not a list of arbitrary length) and compares the records that match in the second file with the same in the first file and reports the values from the first that are not identical in the second in reverse order of appearance (notice in the output how 6t6v_A appears first when in fact it is second when reading from left to right in the source file).

    Without a succinct and testable understanding of the problem you cannot and should not endeavor to build an algorithm to solve it. I do not know that I require a hash of ids...why not use two arrays...one for the source data and another for the test data?

    #!/usr/bin/perl -w use strict; my $first = 0; my ($keynum, @stans, @procs, @resultBucket) = (undef, (),(),()); my $usefulRegex = qr~ID([^\:]+):(.*)$~; while (<DATA>){ ++$first && next if (m/^FILE/ || (m/^\s*$/)); if ($first >= 2) { #processing the data if(m/$usefulRegex/){ $procs[$1] = $2; if ($stans[$1] ne $procs[$1]) { print qq~$1 lines are different\n~; } } } else { #storing the core for comparison if(m/$usefulRegex/){ $stans[$1] = $2; } } print; } print qq~first: $first\n~; __DATA__ FILE1 ID1:6qq5_A|14~~6qq5_B|14~~6qq6_A|14~~6qq6_B|14~~6t6v_A|14 ID2:7d5p_A|14~~7d5q_A|14 FILE2 ID1:6qq5_A|15~~6qq5_B|15~~6qq6_A|14~~6qq6_B|15~~6t6v_A|14 ID2:7d5p_A|14~~7d5q_A|12

    Note that the above code does not solve your problem. It solves an interesting related problem which is this: Between file1 and file2 are the individual records the same or different? Well, originally they are not records at all -- they are strings and a simple string comparison gives the answer.

    The 'secret sauce' of computer science is to solve problems by solving sub-problems first. If the two lines are indeed exactly the same there is no additional work to do. I only want differences. For such a few elements the time difference might not be worth it, but with lots of data you might consider not jumping right into splitting up strings into records when all you want to isolate are differences.

    In my code you might immediately notice that we are processing the records identically so there is no need to have two separate lines that split the line up between id and record data. You might be able to do it once after eliminating the skipped lines and assign the source data when the flag ($first) says to do this or you could use this as an opportunity to implement functions that take alternate inputs.

    A further optimization is that you really don't need the match at all (a dead giveaway is the .*...):

    perl -e '$aline = qq~foo1: fie fo fum~; my ($aid,$rec) = split /:/,$a +line; $aid=~s/^[a-z]+//; print qq~$aid...$rec\n~;' + 1... fie fo fum

    Wrap Up

    The fun part of Perl is solving problems in different ways that seem elegant or 'right' to you but still solve the problem. The latter is the important part. It is perfectly acceptable to produce a piece of crud in your first pass so long as you can prove the result is correct. Then you can concentrate on writing far more elegant, succinct, and 'module quality' Perl code.

    Celebrate Intellectual Diversity