Re: CPAN Module to determing overlap of 2 lists?
by choroba (Cardinal) on Aug 10, 2020 at 22:20 UTC
|
When multiple solutions are possible, which one do you prefer?
E.g.
@list1 = qw( a b c d c );
@list2 = qw( c d c x );
# or qw( c d c x );
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
|
@list1 = qw( a b c d c );
@list2 = qw( c d c x );
# 3
| [reply] [d/l] |
|
|
| [reply] [d/l] |
Re: CPAN Module to determing overlap of 2 lists?
by tybalt89 (Monsignor) on Aug 11, 2020 at 01:30 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11120564
use warnings;
my $file1 = <<END;
one
two
three
four
five
six
END
my $file2 = <<END;
two
three
four
five
six
seven
END
my $marker = '***MARKER***'; # something not in either string
my $combine = "$file1$marker$file2" =~ s/(.*)\K\Q$marker\E\1//sr;
print $combine;
Outputs:
one
two
three
four
five
six
seven
| [reply] [d/l] [select] |
|
|
I hadn't thought much about a regex solution.
To ensure the match start is a complete line, requires a small tweak.
my $combine = "$file1$marker$file2" =~ s/(?:\A|\n)(.*)\K\Q$marker\E\1/
+/sr;
| [reply] [d/l] |
|
|
3 suggestions
- you don't need complete lines to make it work, but anchoring to line start might prove to be faster
- I'd include characters below ASCII 8 to the "marker" to play safe, see also discussion surrounding the similar $;
- you might be interested to check with re "debug" , how the backtracking of the .* submatch performs. I'd guess you prefer it to grow from right to left instead of shrinking from left to right. I know the regex engine can do this depending on the anchors.
I haven't checked the last point since performance might not be your biggest issue.
| [reply] [d/l] |
|
|
|
|
|
|
Re: CPAN Module to determing overlap of 2 lists?
by kcott (Archbishop) on Aug 11, 2020 at 07:58 UTC
|
G'day wazat,
Here's an implementation of &list_overlap.
You could add it to a module if that would be useful to you.
#!/usr/bin/env perl
use strict;
use warnings;
use constant {
REF => 0,
DATA => 1,
EXP => 2,
NAME => 3,
JOIN => $;,
};
use Test::More;
my @ref = qw{a b c d c};
my @same = qw{a b c d c};
my @long = qw{c d c x y z};
my @short = qw{c x y z};
my @none = qw{x y z};
my @tests = (
[ [@ref], [@same], 5, 'Same' ],
[ [@ref], [@long], 3, 'Long' ],
[ [@ref], [@short], 1, 'Short' ],
[ [@ref], [@none], 0, 'None' ],
);
plan tests => 0+@tests;
for my $test (@tests) {
my $got = list_overlap($test->[REF], $test->[DATA]);
is($got, $test->[EXP], $test->[NAME]);
}
sub list_overlap {
my ($ref, $data) = @_;
my $got = 0;
my ($ref_len, $data_len) = (0+@$ref, 0+@$data);
my $start = $ref_len > $data_len ? $ref_len - $data_len : 0;
for my $i ($start .. $ref_len - 1) {
if (join(JOIN, @{$ref}[$i .. $#$ref])
eq join(JOIN, @{$data}[0 .. $#$ref - $i])
) {
$got = 1 + $#$ref - $i;
last;
}
}
return $got;
}
I find $; useful for this type of join
because it's rarely used elsewhere.
Pick something else if that's not appropriate.
Output:
1..4
ok 1 - Same
ok 2 - Long
ok 3 - Short
ok 4 - None
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
I've generally found that Perl's functions that operate on strings (join, index, etc.) are typically very fast.
You can compare with other solutions using Benchmark;
you can profile to find hotspots (e.g. Devel::NYTProf).
I can see a number of micro-optimisations you could make: replace $ref_len - 1 with $#$ref
and replace 1 + $#$ref with $ref_len.
That may gain you absolutely nothing but, I suppose, the code would be a little neater.
You could do a single join.
Then repeatedly chop the front off (index to locate JOIN then substr);
again, that may gain you nothing but perhaps worth a try.
I, and no doubt others, would be interested in the results
if you do decide to benchmark this against your current "brute force solution", a regex solution,
and anything else you may have come up with.
| [reply] [d/l] [select] |
|
|
Re: CPAN Module to determing overlap of 2 lists?
by LanX (Saint) on Aug 10, 2020 at 23:23 UTC
|
Hint: you could just slurp the two files in a variable each and let index do the heavy lifting... :)
| [reply] |
|
|
Yes, index could be more efficient.
I'd need some logic to adjust the size of the window I'm trying to match, but it could work.
I might also look at rindex
| [reply] |
Re: CPAN Module to determing overlap of 2 lists?
by Cristoforo (Curate) on Aug 10, 2020 at 23:35 UTC
|
| [reply] |
|
|
I'm not sure.
Looking at Set::Scalar, I assume the set elements are unique (i.e. (a,b) insert a is still (a,b)), and that the elements are unordered (i.e. (a,b) = (b,a)).
I my case the elements are ordered and duplicates are allowed.
Have I misunderstood Set::Scalar ?
| [reply] |
Re: CPAN Module to determing overlap of 2 lists?
by perlfan (Parson) on Aug 11, 2020 at 02:51 UTC
|
I don't know if this problem has a name.
It's a constrained variation of the longest common subsequence problem.
Anything you do can be easily defeated unless you set some parameters since it's not merely as simple as the trivial example you describe above. For example, what if on two consecutive days you issue the same exact commands. You'd end up with 100% overlap. What is it you're trying to do because your problem description doesn't seem to match up realistically with your contrived overlap example.
And if the files are differentiated by date already, what's the point of even trying to match them up like this?
Update: Using your contrived example where there is true overlap and no repeated elements out of order, then you could iterate over both files in order, each line becomes a key in a Tie::Hash::Indexed tie'd hash (++'d to maintain count). Then all the keys with values of 2 can be taken as the area of overlap. But as I said above, this will not work if there are any repeated lines above or below. So it won't work for you. I do think your problem is ill defined and ambigous. Even your second example has more than one solution based on what you want. | [reply] [d/l] [select] |
|
|
Hmmm, seems more like a constrained variation of
Longest common substring problem. Confusion probably due to looseness of my description.
I agree that the combined files aren't guaranteed to represent the true sequence of commands. The example you provided is one case of aliasing that demonstrates this. Even more problematic is that multiple interactive shells will write to the same history file on exit.
The dates of all the commands in the history file are not known, only the upper bound. The file is a rolling window of the entered commands. If fewer commands are entered than the hist file size, the oldest commands are are eliminated from the file.
| [reply] |
Re: CPAN Module to determing overlap of 2 lists?
by Anonymous Monk on Aug 11, 2020 at 10:14 UTC
|
Yes, if I set HISTFILESIZE to -1, then bash would maintain an unlimited history file
I wish I'd known that, I just keep making the number bigger!
The following doesn't help with the current problem, but: this will be a trivial problem in the future if you add HISTTIMEFORMAT='%Y%m%d-%H%M%S ' to ~/.bashrc, or to /etc/bash.bashrc. Then start a new terminal session and issue a few commands, then issue history | tail and you'll see what I mean. You'll have to do something like history | sort -k 2 | grep 'what you want to see', as somehow things are still out of sequence. I think that happens to me because I generally have 3 or more terminal windows open, doing things in each, and the history file gets written (I think) when the bash session ends,
| [reply] [d/l] [select] |
|
|
I never looked at HISTTIMEFORMAT.
If I set HISTTIMEFORMAT, bash captures the time in the history file as what appears to be epoch seconds. This would definitely speed things up.
#1597175957
echo $$
#1597175963
echo $HOME
#1597175967
ls -la
#1597176017
ls
#1597176017
ls
#1597176018
ls
As you mention, the history file gets written when the shell exits. I too usually have multiple shells at the same time. I guess if we need to separate different sessions we could set HISTFILE to a distinct file name in .bashrc
| [reply] [d/l] |