CPAN Module to determing overlap of 2 lists?

wazat has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: CPAN Module to determing overlap of 2 lists? by choroba (Cardinal) on Aug 10, 2020 at 22:20 UTC
When multiple solutions are possible, which one do you prefer? E.g. `@list1 = qw( a b c d c ); @list2 = qw( c d c x ); # or qw( c d c x );` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^2: CPAN Module to determing overlap of 2 lists? by wazat (Monk) on Aug 10, 2020 at 22:42 UTC
Good point, my problem statement is loose. I want the maximum. `@list1 = qw( a b c d c ); @list2 = qw( c d c x ); # 3` [download]	[reply] [d/l]
Re^3: CPAN Module to determing overlap of 2 lists? by AnomalousMonk (Archbishop) on Aug 10, 2020 at 23:58 UTC
Perhaps update your OP (see How do I change/delete my post?) to reflect the refined requirement? Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re: CPAN Module to determing overlap of 2 lists? by tybalt89 (Monsignor) on Aug 11, 2020 at 01:30 UTC
`#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11120564 use warnings; my $file1 = <<END; one two three four five six END my $file2 = <<END; two three four five six seven END my $marker = '*MARKER'; # something not in either string my $combine = "$file1$marker$file2" =~ s/(.)\K\Q$marker\E\1//sr; print $combine;` [download] Outputs: `one two three four five six seven` [download]	[reply] [d/l] [select]
Re^2: CPAN Module to determing overlap of 2 lists? by wazat (Monk) on Aug 11, 2020 at 19:03 UTC
I hadn't thought much about a regex solution. To ensure the match start is a complete line, requires a small tweak. `my $combine = "$file1$marker$file2" =~ s/(?:\A\|\n)(.*)\K\Q$marker\E\1/ +/sr;` [download]	[reply] [d/l]
Re^3: CPAN Module to determing overlap of 2 lists? by LanX (Saint) on Aug 12, 2020 at 18:27 UTC
3 suggestions you don't need complete lines to make it work, but anchoring to line start might prove to be faster I'd include characters below ASCII 8 to the "marker" to play safe, see also discussion surrounding the similar `$;` you might be interested to check with `re` "debug" , how the backtracking of the `.*` submatch performs. I'd guess you prefer it to grow from right to left instead of shrinking from left to right. I know the regex engine can do this depending on the anchors. I haven't checked the last point since performance might not be your biggest issue. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l]
Re^4: CPAN Module to determing overlap of 2 lists? (updated) by LanX (Saint) on Aug 12, 2020 at 21:08 UTC
Re^4: CPAN Module to determing overlap of 2 lists? by wazat (Monk) on Aug 13, 2020 at 00:50 UTC
Re^5: CPAN Module to determing overlap of 2 lists? by LanX (Saint) on Aug 13, 2020 at 14:15 UTC
Re: CPAN Module to determing overlap of 2 lists? by kcott (Archbishop) on Aug 11, 2020 at 07:58 UTC
G'day wazat, Here's an implementation of `&list_overlap`. You could add it to a module if that would be useful to you. #!/usr/bin/env perl use strict; use warnings; use constant { REF => 0, DATA => 1, EXP => 2, NAME => 3, JOIN => $;, }; use Test::More; my @ref = qw{a b c d c}; my @same = qw{a b c d c}; my @long = qw{c d c x y z}; my @short = qw{c x y z}; my @none = qw{x y z}; my @tests = ( [ [@ref], [@same], 5, 'Same' ], [ [@ref], [@long], 3, 'Long' ], [ [@ref], [@short], 1, 'Short' ], [ [@ref], [@none], 0, 'None' ], ); plan tests => 0+@tests; for my $test (@tests) { my $got = list_overlap($test->[REF], $test->[DATA]); is($got, $test->[EXP], $test->[NAME]); } sub list_overlap { my ($ref, $data) = @_; my $got = 0; my ($ref_len, $data_len) = (0+@$ref, 0+@$data); my $start = $ref_len > $data_len ? $ref_len - $data_len : 0; for my $i ($start .. $ref_len - 1) { if (join(JOIN, @{$ref}[$i .. $#$ref]) eq join(JOIN, @{$data}[0 .. $#$ref - $i]) ) { $got = 1 + $#$ref - $i; last; } } return $got; } [download] I find $; useful for this type of `join` because it's rarely used elsewhere. Pick something else if that's not appropriate. Output: `1..4 ok 1 - Same ok 2 - Long ok 3 - Short ok 4 - None` [download] — Ken	[reply] [d/l] [select]
Re^2: CPAN Module to determing overlap of 2 lists? by wazat (Monk) on Aug 11, 2020 at 19:27 UTC
That works. It feels a bit resource intensive given the repeated temporary strings created in the joins.	[reply]
Re^3: CPAN Module to determing overlap of 2 lists? by kcott (Archbishop) on Aug 12, 2020 at 08:09 UTC
I've generally found that Perl's functions that operate on strings (`join`, `index`, etc.) are typically very fast. You can compare with other solutions using Benchmark; you can profile to find hotspots (e.g. Devel::NYTProf). I can see a number of micro-optimisations you could make: replace `$ref_len - 1` with `$#$ref` and replace `1 + $#$ref` with `$ref_len`. That may gain you absolutely nothing but, I suppose, the code would be a little neater. You could do a single `join`. Then repeatedly chop the front off (`index` to locate `JOIN` then `substr`); again, that may gain you nothing but perhaps worth a try. I, and no doubt others, would be interested in the results if you do decide to benchmark this against your current "brute force solution", a regex solution, and anything else you may have come up with. — Ken	[reply] [d/l] [select]
Re^4: CPAN Module to determing overlap of 2 lists? by wazat (Monk) on Aug 13, 2020 at 01:03 UTC
Re: CPAN Module to determing overlap of 2 lists? by LanX (Saint) on Aug 10, 2020 at 23:23 UTC
Hint: you could just slurp the two files in a variable each and let `index` do the heavy lifting... :) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^2: CPAN Module to determing overlap of 2 lists? by wazat (Monk) on Aug 11, 2020 at 00:22 UTC
Yes, index could be more efficient. I'd need some logic to adjust the size of the window I'm trying to match, but it could work. I might also look at rindex	[reply]
Re: CPAN Module to determing overlap of 2 lists? by Cristoforo (Curate) on Aug 10, 2020 at 23:35 UTC
Set::Scalar is what you want I think.	[reply]
Re^2: CPAN Module to determing overlap of 2 lists? by wazat (Monk) on Aug 11, 2020 at 00:06 UTC
I'm not sure. Looking at Set::Scalar, I assume the set elements are unique (i.e. (a,b) insert a is still (a,b)), and that the elements are unordered (i.e. (a,b) = (b,a)). I my case the elements are ordered and duplicates are allowed. Have I misunderstood Set::Scalar ?	[reply]
Re: CPAN Module to determing overlap of 2 lists? by perlfan (Parson) on Aug 11, 2020 at 02:51 UTC
I don't know if this problem has a name. It's a constrained variation of the longest common subsequence problem. Anything you do can be easily defeated unless you set some parameters since it's not merely as simple as the trivial example you describe above. For example, what if on two consecutive days you issue the same exact commands. You'd end up with 100% overlap. What is it you're trying to do because your problem description doesn't seem to match up realistically with your contrived overlap example. And if the files are differentiated by date already, what's the point of even trying to match them up like this? Update: Using your contrived example where there is true overlap and no repeated elements out of order, then you could iterate over both files in order, each line becomes a key in a Tie::Hash::Indexed tie'd hash (`++`'d to maintain count). Then all the keys with values of `2` can be taken as the area of overlap. But as I said above, this will not work if there are any repeated lines above or below. So it won't work for you. I do think your problem is ill defined and ambigous. Even your second example has more than one solution based on what you want.	[reply] [d/l] [select]
Re^2: CPAN Module to determing overlap of 2 lists? by wazat (Monk) on Aug 11, 2020 at 19:54 UTC
Hmmm, seems more like a constrained variation of Longest common substring problem. Confusion probably due to looseness of my description. I agree that the combined files aren't guaranteed to represent the true sequence of commands. The example you provided is one case of aliasing that demonstrates this. Even more problematic is that multiple interactive shells will write to the same history file on exit. The dates of all the commands in the history file are not known, only the upper bound. The file is a rolling window of the entered commands. If fewer commands are entered than the hist file size, the oldest commands are are eliminated from the file.	[reply]
Re: CPAN Module to determing overlap of 2 lists? by Anonymous Monk on Aug 11, 2020 at 10:14 UTC
Yes, if I set HISTFILESIZE to -1, then bash would maintain an unlimited history file I wish I'd known that, I just keep making the number bigger! The following doesn't help with the current problem, but: this will be a trivial problem in the future if you add `HISTTIMEFORMAT='%Y%m%d-%H%M%S '` to ~/.bashrc, or to /etc/bash.bashrc. Then start a new terminal session and issue a few commands, then issue `history \| tail` and you'll see what I mean. You'll have to do something like `history \| sort -k 2 \| grep 'what you want to see'`, as somehow things are still out of sequence. I think that happens to me because I generally have 3 or more terminal windows open, doing things in each, and the history file gets written (I think) when the bash session ends,	[reply] [d/l] [select]
Re^2: CPAN Module to determing overlap of 2 lists? by wazat (Monk) on Aug 11, 2020 at 20:15 UTC
I never looked at HISTTIMEFORMAT. If I set HISTTIMEFORMAT, bash captures the time in the history file as what appears to be epoch seconds. This would definitely speed things up. `#1597175957 echo $$ #1597175963 echo $HOME #1597175967 ls -la #1597176017 ls #1597176017 ls #1597176018 ls` [download] As you mention, the history file gets written when the shell exits. I too usually have multiple shells at the same time. I guess if we need to separate different sessions we could set HISTFILE to a distinct file name in .bashrc	[reply] [d/l]