Re: searching for strings
by BrowserUk (Patriarch) on Aug 06, 2007 at 11:01 UTC
|
Definitely one of those things that appears on the surface to be trivial, but turns out to be much harder:
Here's one way
#! perl -slw
use strict;
use Data::Dump qw[ pp ];
my %hash;
while( <DATA> ) {
chomp;
my( $pre, $temp ) = m[(^.+?)(\d+|\D)$];
$temp = $pre . ++$temp;
if( exists $hash{ $_ } ) {
push @{ $hash{ $_ } }, $_;
}
elsif( exists $hash{ $temp } ) {
push @{ $hash { $temp } }, $_;
}
else {
$hash{ $temp } = [ $_ ];
$hash{ $_ } = [ $_ ];
}
}
print join ';', @{ $_ }
for grep{ @{ $_ } == 2 } values %hash;
__DATA__
AAA30
BBC5
SHT12H
DAL33B
BBC49
AAA31
DAL33A
BBC6
SHT12G
BBC50
Output: C:\test>junk
DAL33B;DAL33A
AAA30;AAA31
BBC49;BBC50
BBC5;BBC6
SHT12H;SHT12G
If you need the doubled-up output, just print them twice with the elements reversed.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
BrowserUk,
I was trying to think of more edge cases and I think I have found a few steph_bow needs to weigh in on.
For instance, what if a string is ABC0? Should ABC1 and ABC-1 be checked?
Also, what if the string is XYZ? Should XYY and XYAA be checked?
Is the alphabet circular - IOW, should AAA look at AAZ and AAB?
Cheers - L~R
Corrected XYZ -> XYAA rather than XYZA
| [reply] |
|
|
I have found a few steph_bow needs to weigh in on.
Agreed. That said, the test set provided is amazingly complete given how concise it is.
There is nothing accidental about either the choice of sample, or its ordering. A lot of thought went into its provision.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
Re: searching for strings
by shmem (Chancellor) on Aug 06, 2007 at 10:45 UTC
|
What are the tools to use ?
substr, ord, chr (or pack and unpack), hashes (see perldata), pre/post in/de-crement (for strings, only increment works), integer/string comparison (== and eq) (see perlop).
--shmem
_($_=" "x(1<<5)."?\n".q·/)Oo. G°\ /
/\_¯/(q /
---------------------------- \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
| [reply] |
Re: searching for strings
by moritz (Cardinal) on Aug 06, 2007 at 10:12 UTC
|
If the prefix (ie everything except the last character) is uniq, you could strip off the last character, and use the remaning string as a hash key.
In the hash you store a list of matching strings.
When you have created the hash, you should iterate through it and check if the +1/-1 condition holds.
| [reply] |
Re: searching for strings
by Anno (Deacon) on Aug 06, 2007 at 12:51 UTC
|
my (%num, %alpha);
while ( <DATA> ) {
chomp;
my ( $head, $tail) = /^(.+?)(\d+|[[:alpha:]])$/;
if ( $tail =~ /^\d+$/ ) {
$num{ $head}->{ $tail} = $_;
} else {
$alpha{ $head}->{ ord $tail} = $_;
}
}
my @pairs = (
map( extract_pairs( $_) => values %num),
map( extract_pairs( $_) => values %alpha),
);
print "$_\n" for @pairs;
exit;
sub extract_pairs {
my $h = shift;
my @pairs;
for ( keys %$h ) {
for my $partner ( $_ - 1, $_ + 1 ) {
if ( exists $h->{ $partner} ) {
push @pairs, "$h->{ $_};$h->{ $partner}";
delete @$h{ $_, $partner};
}
}
}
@pairs;
}
__DATA__
AAA30
BBC5
SHT12H
DAL33B
BBC49
AAA31
DAL33A
BBC6
SHT12G
BBC50
This code will ignore singletons for which a pairing partner cannot be found. If there are more than two consecutive pairs for a single head, say if you had
BBC7 besides BBC5 and BBC6 the bevavior is undefined. It will pick some pair(s) and may ignore others.
Anno | [reply] [d/l] [select] |
Re: searching for strings
by GrandFather (Saint) on Aug 06, 2007 at 19:01 UTC
|
use strict;
use warnings;
my %data;
my @reverse;
while (<DATA>) {
chomp;
my ($prefix, $suffix) = /^([A-Z0-9]+?)(\d+|.)$/;
if (exists $data{$prefix}) {
my $mapFunc = $suffix =~ /[A-Z]/ ? sub {ord $_[0]} : sub {$_[0
+]};
for (@{$data{$prefix}}) {
next if abs ($mapFunc->($suffix) - $mapFunc->($_)) != 1;
print "$prefix$_;$prefix$suffix\n";
push @reverse, "$prefix$suffix;$prefix$_\n";
}
}
push @{$data{$prefix}}, $suffix;
}
print @reverse;
__DATA__
AAA30
BBC5
SHT12H
DAL33B
BBC49
AAA31
DAL33A
BBC6
SHT12G
BBC50
Prints:
AAA30;AAA31
DAL33B;DAL33A
BBC5;BBC6
SHT12H;SHT12G
BBC49;BBC50
AAA31;AAA30
DAL33A;DAL33B
BBC6;BBC5
SHT12G;SHT12H
BBC50;BBC49
DWIM is Perl's answer to Gödel
| [reply] [d/l] [select] |
Re: searching for strings
by oha (Friar) on Aug 06, 2007 at 11:30 UTC
|
you can use the magic of ++:
my @d = qw(AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC
+50);
my $last = "";
foreach my $v ( sort {$a cmp $b} @d)
{
my $l1 = $last;
my $l2 = chop $l1;
print "$last,$v\n" if $v eq $l1.++$l2;
$last = $v;
}
| [reply] [d/l] |
|
|
C:\test>perl
my @d = qw(AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC
+50);
my $last = "";
foreach my $v ( sort {$a cmp $b} @d)
{
my $l1 = $last;
my $l2 = chop $l1;
print "$last,$v\n" if $v eq $l1.++$l2;
$last = $v;
}
^Z
AAA30,AAA31
DAL33A,DAL33B
SHT12G,SHT12H
BBC5;BBC6 and BBC49;BBC50 are missing. Working out why is interesting.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
right sorry! cauz i can't expect the sorting of BBC5 and BBC6 without BBC50 beetween.
the following works:
my @d = qw(AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC
+50);
my %seen;
map { print "$seen{$_},$_\n" if $seen{$_}; }
map { my $v = $_; my $l = chop $v; $seen{$v.++$l}=$_; } @d;
and nicer, imho :)
Oha | [reply] [d/l] |
|
|
Re: searching for strings
by CountZero (Bishop) on Aug 06, 2007 at 15:45 UTC
|
As a partial answer to your problem, here are two subroutines which give you the previous and next values:
use strict;
my $plus;
my $minus;
while (<DATA>) {
chomp;
my $next = next_seq($_);
my $previous = previous_seq($_);
print "$_, $next, $previous\n";
}
sub next_seq {
my $pattern = shift;
chomp $pattern;
if ($pattern =~ m/(.*?)(\d+)$/) {
return $1 . ($2 + 1);
}
else {
$pattern =~ m/(.*)(.)/;
return $1 . chr(ord($2) + 1);
}
}
sub previous_seq {
my $pattern = shift;
chomp $pattern;
if ($pattern =~ m/(.*?)(\d+)$/) {
return $1 . ($2 - 1);
}
else {
$pattern =~ m/(.*)(.)/;
return $1 . chr(ord($2) - 1);
}
}
__DATA__
AAA30
BBC5
SHT12H
DAL33B
BBC49
AAA31
DAL33A
BBC6
SHT12G
BBC50
ABCZ
Warning: I'm not sure it will work with unicode-strings and anyhow it will only take into account the last character (unless the pattern ends in a one or more digits). So @ precedes A and [ follows Z.Now all you have to do is go through the list and search for items which match the next or previous values.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] [d/l] [select] |
Re: searching for strings
by shoness (Friar) on Aug 06, 2007 at 12:06 UTC
|
What do you do when the pattern you search for is at a boundary, taking either increment or decrement out of range? In your example above, you match near 'AAA30'. Do you only look for 'AAA30' and 'AAA31'? Do you also look for 'AAA29'?
If so, the hash-key idea won't work, because you're not only changing the last character.
You could do something like this for each of pattern:
my ($pat, $char) = $pattern =~ m/(.*)(.)$/;
my $char_up = $char;
my $char_down = $char;
++$char_up unless ($char =~ m/[9Z]/i);
--$char_down unless ($char =~ m/[0A]/i);
my $search = qr/$pat$char|$pat$char_up|$pat$char_down/;
print $search;
How many patterns do you have anyway? How many lines do you search? | [reply] [d/l] |
|
|
the -- decrement operator is not magical for non-digits.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] [d/l] |
|
|
Sorry! How strange of Perl to be asymmetric! Why is "--" not as magical as "++"?
This will work instead...
$char_up =~ tr/A-Z0-9/B-ZZ1-99/;
$char_down =~ tr/A-Z0-9/AA-Y00-8/;
| [reply] [d/l] |
|
|
Re: searching for strings
by Zielony (Acolyte) on Aug 06, 2007 at 13:41 UTC
|
#!/usr/bin/perl
use strict;
sub simstr {
$_ = shift;
if (/([a-z])$/i) {
my $ch = $1;
$ch++;
$ch =~ s/.*(.)$/$1/;
substr ($_, -1) = $ch;
}
elsif (/\d+$/) {
$_++;
}
$_;
}
for (<DATA>) {
chomp;
print "$_;" . simstr ($_) . "\n";
}
__DATA__
AAA30
BBC5
SHT12H
DAL33B
BBC49
AAA31
DAL33A
BBC6
SHT12G
BBC50
| [reply] [d/l] |
|
|
Consider these two data elements: CBG99 and CBG100. When they are placed into your code, the match is CBG99;CBH00
Very fun problem here! :-)
-P
| [reply] [d/l] [select] |
|
|
#!/usr/bin/perl
use strict;
sub simstr {
$_ = shift;
/([a-z]|\d+)$/i;
my $ch = $1;
$ch++;
$ch =~ s/.*(.)$/$1/ if ($ch =~ /[a-z]/i);
substr ($_, -(length $1)) = $ch;
$_;
}
for (<DATA>) {
chomp;
print "$_;" . simstr ($_) . "\n";
}
__DATA__
AAA30
BBC5
SHT12H
DAL33B
BBC49
AAA31
DAL33A
BBC6
SHT12G
BBC50
CBG99
WXYZ
Now it works and isn't even a bit funny. ;-) | [reply] [d/l] |
Re: searching for strings
by ww (Archbishop) on Aug 06, 2007 at 15:34 UTC
|
The code below is probably inelegant and perhaps even wrongheaded
but I'm posting, in part, TIMTOWTDI; in part, to illustrate some of the well-taken issues with the likes of AAA29 vs AAA30 (above), and in part, because even though this is NOT a solution, $work calls:
#!usr/bin/perl
use strict;
use warnings;
use vars qw (@fields $field $seen);
while (<DATA>) {
chomp $_;
push @fields, $_;
}
for $field(@fields) {
$seen = $field;
our $base_seen = substr($seen, 0, -1); # all but last char of $se
+en
our $root = chop($seen); # get last char of $seen
+
test1($root, $base_seen);
}
#####subs
sub test1 { # Excluding the last char, test whether the f
+ields match
for $field(@fields) {
our $base_field = substr($field, 0, -1); # all but last char of
+$field
if ($main::base_seen eq $base_field) {
test2($main::root, $main::base_seen, $field);
}
}
}
sub test2 { # bases matched, now test last char for a matc
+h +/- 1
my $root = shift(@_);
my $t1 = (shift(@_) . $root);
my $test1 = shift(@_);
if ( $t1 eq $test1 ) {
# Returning, t1 is identical to test1 (haven't figured out how t
+o skip this comparison entirely)
return;
}
else {
# print "\nIn else of test2, ready to cmp: $t1, $test1\n";
my $rev_last_t1 = reverse($t1); # reverse, for e
+ase of m//
my $rev_last_test1 = reverse($test1);
#test "last" char which is now the initial char in $rev_last_t1
+via reverse
my $last_rlt1 = ord($rev_last_t1 =~ /(.)/); # numify alphas
+ for +/- below
my $last_test1 = ord($rev_last_test1 =~ /(.)/);
if ( $last_rlt1 == ( ($last_test1++) || ($last_test1--) ) ) {
print "same base and adjacent terminal chars: $t1;$test1\n";
}
}
}
# end
# one value added to OP's data:
__DATA__
AAA30
BBC5
SHT12H
DAL33B
BBC49
AAA31
DAL33A
BBC6
SHT12G
BBC50
AAA37
AAA29
OUTPUT:
same base and adjacent terminal chars: AAA30;AAA31
same base and adjacent terminal chars: AAA30;AAA37 # False positive
same base and adjacent terminal chars: BBC5;BBC6
same base and adjacent terminal chars: SHT12H;SHT12G
same base and adjacent terminal chars: DAL33B;DAL33A
same base and adjacent terminal chars: AAA31;AAA30
same base and adjacent terminal chars: AAA31;AAA37 # False positive
same base and adjacent terminal chars: DAL33A;DAL33B
same base and adjacent terminal chars: BBC6;BBC5
same base and adjacent terminal chars: SHT12G;SHT12H
same base and adjacent terminal chars: AAA37;AAA30 # False positive
same base and adjacent terminal chars: AAA37;AAA31 # False positive
NOTE1: Fails on AAA29 vs AAA30 and on BBC49 vs BBC50 for (the intended?) reading of OP's definition; if OP had specified "digit" instead of "number" this would not be a failure, but, as is, makes the problem challenging
NOTE2: Fails on AAA37 vs (AAA30 or AAA31) by any reading; this is a glitch in the code /me offers above.
NOTE3: Use any one of several mechanisms, including hashes, to eliminate dupes such as "AAA30;AAA31" and "AAA31;AAA30")
| [reply] [d/l] [select] |
Re: searching for strings
by wind (Priest) on Aug 06, 2007 at 20:56 UTC
|
There are two things that make this much easier. One, is to iterate over the data twice, the first time to set up your relationships, the second time to report their existance. And two, rely solely on incrementing +1, as it is the easier of the relationships to calculate because of the special magic of ++ on strings.
You still must define the characteristics of your boundary conditions though. Ideally, we want the increment and decrement relationships to be one to one. What is the increment of "Bar999"? If it is "Bar1000", then your relationship is no longer 1 to 1 as there is also "Bar0999".
Anyway, here is an implementation without the boundary conditions fully defined. I've added two data entries for lines that do not match:
use Fcntl qw(SEEK_SET);
use strict;
# Calculate Relationships.
# - Rely on increment, as it's the easier of the two to calculate.
my %decrement;
my %increment; # Synonymous with existance.
my $start_of_DATA = tell DATA;
while (<DATA>) {
chomp;
my $item = $_;
# Note concerning parsing
# - This regex requires that a prefix exist.
if ($item =~ m{ (\w+) ( (?<!\d)\d+ | (?<![A-Z])[A-Z]+ ) \z}x) {
my ($prefix, $suffix) = ($1, $2);
# Note: this is the primary spot where there might be changes
+in rules.
# - What happens when the character ends in 'Z'? Currently
# That would translate to 'AA'.
# - What happens when number is 999? Currently that would tra
+nslate
# to '1000'.
# Fix the rules here, and everything else will translate.
(my $suffix_next = $suffix)++;
my $item_next = $prefix . $suffix_next;
$increment{$item} = $item_next;
$decrement{$item_next} = $item;
} else {
die "Invalid data: $_";
}
}
# Reparse DATA
seek(DATA, $start_of_DATA, SEEK_SET);
while (<DATA>) {
chomp;
print;
if ($increment{ $decrement{ $_ } }) {
print ";$decrement{$_}";
} elsif ($increment{ $increment{ $_ } }) {
print ";$increment{$_}";
}
print "\n";
}
__DATA__
AAA30
BBC5
SHT12H
DAL33B
BBC49
AAA31
BBC8
BBC3
DAL33A
BBC6
SHT12G
BBC50
And the output is:
>perl scratch.pl
AAA30;AAA31
BBC5;BBC6
SHT12H;SHT12G
DAL33B;DAL33A
BBC49;BBC50
AAA31;AAA30
BBC8
BBC3
DAL33A;DAL33B
BBC6;BBC5
SHT12G;SHT12H
BBC50;BBC49
- Miller | [reply] [d/l] [select] |
|
|
steph_bow asked "could you tell me what is the signification of (?<!\d) in your code ?"
The purpose of the negative look behind assertions in my regular expression was to ensure that there was a key included in the data value. This was not strictly part of your stated requirement, but it seemed implied and therefore should be enforced.
If this didn't matter, than we could simply rely on non-greedy matching in order to separate the key from the suffix.
while (<DATA>) {
chomp;
if (m{\A (.*?) (\d+ | [A-Z]+) \z}x) {
printf "Key = %-6s - Suffix = %s\n", $1, $2;
} else {
die "Invalid data: $_";
}
}
__DATA__
AAA30
BBC5
SHT12H
DAL33B
BBC49
AAA31
BBC8
BBC3
DAL33A
BBC6
SHT12G
BBC50
However, what about the case of a value of '123456' or 'ABCDEF'? Currently those two values would validate to a key of the empty string "", and a suffix of the entire string. To avoid this, we start by changing the key matching from * (0 or more) to + (1 or more). However, this will just eat up 1 character of our suffix. We therefore add negative look behind assertions to the suffix matching in order to ensure that the suffix is matched on a boundary.
This might have been a little obsessive, but it's always a good idea to validate your data so that you are absolutely sure that the rules of your logic are being followed.
- Miller | [reply] [d/l] |
Re: searching for strings
by RMGir (Prior) on Aug 06, 2007 at 11:02 UTC
|
Your problem _sounds_ complicated, but it's pretty simple, really.
First off, the 2 matching strings have to be the same length, so you can use the length function to check that.
Next, if you use substr to look at the last character of 2 strings, the first n-1 characters have to be equal and the last characters have to differ by 1 when compared using ord, or the strings aren't a match.
Finally, you can use split to split your strings using a sequence of digits as the delimiter, like
my @fields=split /(\d+)/,$str;
Using the /(\d+)/, the delimiter will get captured too, and now you can make sure that the string fields are the same, and the number field differs only by one.
Once you've got all that bundled into a comparison function, all you have to do is scan the list for matches for each string, and pair them up...
I think you can get the rest of the way to your answer from here...
| [reply] |
|
|
Next, if you use substr to look at the last character of 2 strings, the first n-1 characters have to be equal and the last characters have to differ by 1 when compared using ord, or the strings aren't a match.
That doesn't work for 'BBC49' and 'BBC50'.
Also, the process you are describing is O(N2). Feasible, but impractical for large lists.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
| [reply] |
Re: searching for strings
by Anonymous Monk on Aug 06, 2007 at 11:10 UTC
|
You can sort the input and then compare each string with its neighbors instead of using a hash. | [reply] |
|
|
That would not work as ABC9 and ABC10 would not necessarily be next to each other.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] [d/l] [select] |
Re: searching for strings
by injunjoel (Priest) on Aug 07, 2007 at 19:35 UTC
|
Late as usual I see...
Here is my suggestion
#!/usr/bin/perl
use strict;
{
local %_;
while(<DATA>){
chomp;
if(/(\d+|[A-Z])$/){
$_{substr($_,0,(length($_)-length($1)))}->{$1} = $_;
}
}
for(sort keys %_){
if(scalar(keys %{$_{$_}}) % 2 == 0){
local @_ = sort{$a <=> $b || ord($a) <=> ord($b)}keys %{$_
+{$_}};
for(my $i = 0; $i < scalar(@_); $i+=2){
if( ($_[$i] =~ /\d$/ && abs($_[$i] - $_[$i+1]) == 1) |
+| (abs(ord($_[$i]) - ord($_[$i+1])) == 1) ){
print $_{$_}->{$_[$i]}.";".$_{$_}->{$_[$i+1]}."\n"
+ ;
print $_{$_}->{$_[$i+1]}.";".$_{$_}->{$_[$i]}."\n"
+ ;
}
}
}
}
}
__DATA__
AAA30
BBC5
SHT12H
DAL33B
BBC49
AAA31
BBC8
BBC3
DAL33A
BBC6
SHT12G
BBC50
Output
AAA30;AAA31
AAA31;AAA30
BBC49;BBC50
BBC50;BBC49
DAL33A;DAL33B
DAL33B;DAL33A
SHT12G;SHT12H
SHT12H;SHT12G
Not exactly the OP's output but close...
Just a thought :)
-InjunJoel
"I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo
| [reply] [d/l] [select] |