searching for strings

steph_bow has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: searching for strings by BrowserUk (Patriarch) on Aug 06, 2007 at 11:01 UTC
Definitely one of those things that appears on the surface to be trivial, but turns out to be much harder: Here's one way `#! perl -slw use strict; use Data::Dump qw[ pp ]; my %hash; while( <DATA> ) { chomp; my( $pre, $temp ) = m[(^.+?)(\d+\|\D)$]; $temp = $pre . ++$temp; if( exists $hash{ $_ } ) { push @{ $hash{ $_ } }, $_; } elsif( exists $hash{ $temp } ) { push @{ $hash { $temp } }, $_; } else { $hash{ $temp } = [ $_ ]; $hash{ $_ } = [ $_ ]; } } print join ';', @{ $_ } for grep{ @{ $_ } == 2 } values %hash; __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC50` [download] Output: `C:\test>junk DAL33B;DAL33A AAA30;AAA31 BBC49;BBC50 BBC5;BBC6 SHT12H;SHT12G` [download] If you need the doubled-up output, just print them twice with the elements reversed. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^2: searching for strings by Limbic~Region (Chancellor) on Aug 06, 2007 at 12:48 UTC
BrowserUk, I was trying to think of more edge cases and I think I have found a few steph_bow needs to weigh in on. For instance, what if a string is ABC0? Should ABC1 and ABC-1 be checked? Also, what if the string is XYZ? Should XYY and XYAA be checked? Is the alphabet circular - IOW, should AAA look at AAZ and AAB? Cheers - L~R Corrected XYZ -> XYAA rather than XYZA	[reply]
Re^3: searching for strings by BrowserUk (Patriarch) on Aug 06, 2007 at 17:51 UTC
I have found a few steph_bow needs to weigh in on. Agreed. That said, the test set provided is amazingly complete given how concise it is. There is nothing accidental about either the choice of sample, or its ordering. A lot of thought went into its provision. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^4: searching for strings by shmem (Chancellor) on Aug 06, 2007 at 18:08 UTC
Re: searching for strings by shmem (Chancellor) on Aug 06, 2007 at 10:45 UTC
What are the tools to use ? substr, ord, chr (or pack and unpack), hashes (see perldata), pre/post in/de-crement (for strings, only increment works), integer/string comparison (== and eq) (see perlop). --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]
Re: searching for strings by moritz (Cardinal) on Aug 06, 2007 at 10:12 UTC
If the prefix (ie everything except the last character) is uniq, you could strip off the last character, and use the remaning string as a hash key. In the hash you store a list of matching strings. When you have created the hash, you should iterate through it and check if the +1/-1 condition holds. Perl 6 in German	[reply]
Re: searching for strings by Anno (Deacon) on Aug 06, 2007 at 12:51 UTC
This is a rather complex task. What have you tried so far? The basic tools would be a pattern that splits a line into a head (`AAA BBC DAL33`) and a tail (the final number or single character used for pairing). This could be used: `my( $head, $tail) = /^(.+?)(\d+\|[[:alpha:]])$/;` though your description allows for alternatives. One or more hashes to organize the data in a way that the pairings can easily be extracted. Here is one way to do it: my (%num, %alpha); while ( <DATA> ) { chomp; my ( $head, $tail) = /^(.+?)(\d+\|[[:alpha:]])$/; if ( $tail =~ /^\d+$/ ) { $num{ $head}->{ $tail} = $_; } else { $alpha{ $head}->{ ord $tail} = $_; } } my @pairs = ( map( extract_pairs( $_) => values %num), map( extract_pairs( $_) => values %alpha), ); print "$_\n" for @pairs; exit; sub extract_pairs { my $h = shift; my @pairs; for ( keys %$h ) { for my $partner ( $_ - 1, $_ + 1 ) { if ( exists $h->{ $partner} ) { push @pairs, "$h->{ $_};$h->{ $partner}"; delete @$h{ $_, $partner}; } } } @pairs; } __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC50 [download] This code will ignore singletons for which a pairing partner cannot be found. If there are more than two consecutive pairs for a single head, say if you had `BBC7` besides `BBC5` and `BBC6` the bevavior is undefined. It will pick some pair(s) and may ignore others. Anno	[reply] [d/l] [select]
Re: searching for strings by GrandFather (Saint) on Aug 06, 2007 at 19:01 UTC
Yet another way to do it: use strict; use warnings; my %data; my @reverse; while (<DATA>) { chomp; my ($prefix, $suffix) = /^([A-Z0-9]+?)(\d+\|.)$/; if (exists $data{$prefix}) { my $mapFunc = $suffix =~ /[A-Z]/ ? sub {ord $_[0]} : sub {$_[0 +]}; for (@{$data{$prefix}}) { next if abs ($mapFunc->($suffix) - $mapFunc->($_)) != 1; print "$prefix$_;$prefix$suffix\n"; push @reverse, "$prefix$suffix;$prefix$_\n"; } } push @{$data{$prefix}}, $suffix; } print @reverse; __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC50 [download] Prints: `AAA30;AAA31 DAL33B;DAL33A BBC5;BBC6 SHT12H;SHT12G BBC49;BBC50 AAA31;AAA30 DAL33A;DAL33B BBC6;BBC5 SHT12G;SHT12H BBC50;BBC49` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: searching for strings by oha (Friar) on Aug 06, 2007 at 11:30 UTC
you can use the magic of ++: `my @d = qw(AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC +50); my $last = ""; foreach my $v ( sort {$a cmp $b} @d) { my $l1 = $last; my $l2 = chop $l1; print "$last,$v\n" if $v eq $l1.++$l2; $last = $v; }` [download]	[reply] [d/l]
Re^2: searching for strings by BrowserUk (Patriarch) on Aug 06, 2007 at 11:40 UTC
Tried that one also. It doesn't work either. `C:\test>perl my @d = qw(AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC +50); my $last = ""; foreach my $v ( sort {$a cmp $b} @d) { my $l1 = $last; my $l2 = chop $l1; print "$last,$v\n" if $v eq $l1.++$l2; $last = $v; } ^Z AAA30,AAA31 DAL33A,DAL33B SHT12G,SHT12H` [download] `BBC5;BBC6` and `BBC49;BBC50` are missing. Working out why is interesting. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^3: searching for strings by oha (Friar) on Aug 06, 2007 at 13:13 UTC
right sorry! cauz i can't expect the sorting of BBC5 and BBC6 without BBC50 beetween. the following works: `my @d = qw(AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC +50); my %seen; map { print "$seen{$_},$_\n" if $seen{$_}; } map { my $v = $_; my $l = chop $v; $seen{$v.++$l}=$_; } @d;` [download] and nicer, imho :) Oha	[reply] [d/l]
Re^4: searching for strings by FunkyMonk (Bishop) on Aug 06, 2007 at 14:16 UTC
Re: searching for strings by CountZero (Bishop) on Aug 06, 2007 at 15:45 UTC
As a partial answer to your problem, here are two subroutines which give you the previous and next values: use strict; my $plus; my $minus; while (<DATA>) { chomp; my $next = next_seq($_); my $previous = previous_seq($_); print "$_, $next, $previous\n"; } sub next_seq { my $pattern = shift; chomp $pattern; if ($pattern =~ m/(.?)(\d+)$/) { return $1 . ($2 + 1); } else { $pattern =~ m/(.)(.)/; return $1 . chr(ord($2) + 1); } } sub previous_seq { my $pattern = shift; chomp $pattern; if ($pattern =~ m/(.?)(\d+)$/) { return $1 . ($2 - 1); } else { $pattern =~ m/(.)(.)/; return $1 . chr(ord($2) - 1); } } __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC50 ABCZ [download] Warning: I'm not sure it will work with unicode-strings and anyhow it will only take into account the last character (unless the pattern ends in a one or more digits). So `@` precedes `A` and `[` follows `Z`. Now all you have to do is go through the list and search for items which match the next or previous values. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re: searching for strings by shoness (Friar) on Aug 06, 2007 at 12:06 UTC
What do you do when the pattern you search for is at a boundary, taking either increment or decrement out of range? In your example above, you match near 'AAA30'. Do you only look for 'AAA30' and 'AAA31'? Do you also look for 'AAA29'? If so, the hash-key idea won't work, because you're not only changing the last character. You could do something like this for each of pattern: `my ($pat, $char) = $pattern =~ m/(.*)(.)$/; my $char_up = $char; my $char_down = $char; ++$char_up unless ($char =~ m/[9Z]/i); --$char_down unless ($char =~ m/[0A]/i); my $search = qr/$pat$char\|$pat$char_up\|$pat$char_down/; print $search;` [download] How many patterns do you have anyway? How many lines do you search?	[reply] [d/l]
Re^2: searching for strings by CountZero (Bishop) on Aug 06, 2007 at 15:21 UTC
the `--` decrement operator is not magical for non-digits. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]
Re^3: searching for strings by shoness (Friar) on Aug 07, 2007 at 09:06 UTC
Sorry! How strange of Perl to be asymmetric! Why is "--" not as magical as "++"? This will work instead... `$char_up =~ tr/A-Z0-9/B-ZZ1-99/; $char_down =~ tr/A-Z0-9/AA-Y00-8/;` [download]	[reply] [d/l]
Re^4: searching for strings by oha (Friar) on Aug 07, 2007 at 11:10 UTC
Re: searching for strings by Zielony (Acolyte) on Aug 06, 2007 at 13:41 UTC
`#!/usr/bin/perl use strict; sub simstr { $_ = shift; if (/([a-z])$/i) { my $ch = $1; $ch++; $ch =~ s/.*(.)$/$1/; substr ($_, -1) = $ch; } elsif (/\d+$/) { $_++; } $_; } for (<DATA>) { chomp; print "$_;" . simstr ($_) . "\n"; } __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC50` [download]	[reply] [d/l]
Re^2: searching for strings by Philem (Acolyte) on Aug 06, 2007 at 14:43 UTC
Consider these two data elements: `CBG99 and CBG100`. When they are placed into your code, the match is `CBG99;CBH00` Very fun problem here! :-) -P	[reply] [d/l] [select]
Re^3: searching for strings by Zielony (Acolyte) on Aug 06, 2007 at 16:00 UTC
`#!/usr/bin/perl use strict; sub simstr { $_ = shift; /([a-z]\|\d+)$/i; my $ch = $1; $ch++; $ch =~ s/.*(.)$/$1/ if ($ch =~ /[a-z]/i); substr ($_, -(length $1)) = $ch; $_; } for (<DATA>) { chomp; print "$_;" . simstr ($_) . "\n"; } __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC50 CBG99 WXYZ` [download] Now it works and isn't even a bit funny. ;-)	[reply] [d/l]
Re: searching for strings by ww (Archbishop) on Aug 06, 2007 at 15:34 UTC
The code below is probably inelegant and perhaps even wrongheaded but I'm posting, in part, TIMTOWTDI; in part, to illustrate some of the well-taken issues with the likes of AAA29 vs AAA30 (above), and in part, because even though this is NOT a solution, $work calls: #!usr/bin/perl use strict; use warnings; use vars qw (@fields $field $seen); while (<DATA>) { chomp $_; push @fields, $_; } for $field(@fields) { $seen = $field; our $base_seen = substr($seen, 0, -1); # all but last char of $se +en our $root = chop($seen); # get last char of $seen + test1($root, $base_seen); } #####subs sub test1 { # Excluding the last char, test whether the f +ields match for $field(@fields) { our $base_field = substr($field, 0, -1); # all but last char of +$field if ($main::base_seen eq $base_field) { test2($main::root, $main::base_seen, $field); } } } sub test2 { # bases matched, now test last char for a matc +h +/- 1 my $root = shift(@_); my $t1 = (shift(@_) . $root); my $test1 = shift(@_); if ( $t1 eq $test1 ) { # Returning, t1 is identical to test1 (haven't figured out how t +o skip this comparison entirely) return; } else { # print "\nIn else of test2, ready to cmp: $t1, $test1\n"; my $rev_last_t1 = reverse($t1); # reverse, for e +ase of m// my $rev_last_test1 = reverse($test1); #test "last" char which is now the initial char in $rev_last_t1 +via reverse my $last_rlt1 = ord($rev_last_t1 =~ /(.)/); # numify alphas + for +/- below my $last_test1 = ord($rev_last_test1 =~ /(.)/); if ( $last_rlt1 == ( ($last_test1++) \|\| ($last_test1--) ) ) { print "same base and adjacent terminal chars: $t1;$test1\n"; } } } # end # one value added to OP's data: __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC50 AAA37 AAA29 [download] OUTPUT: same base and adjacent terminal chars: AAA30;AAA31 same base and adjacent terminal chars: AAA30;AAA37 # False positive same base and adjacent terminal chars: BBC5;BBC6 same base and adjacent terminal chars: SHT12H;SHT12G same base and adjacent terminal chars: DAL33B;DAL33A same base and adjacent terminal chars: AAA31;AAA30 same base and adjacent terminal chars: AAA31;AAA37 # False positive same base and adjacent terminal chars: DAL33A;DAL33B same base and adjacent terminal chars: BBC6;BBC5 same base and adjacent terminal chars: SHT12G;SHT12H same base and adjacent terminal chars: AAA37;AAA30 # False positive same base and adjacent terminal chars: AAA37;AAA31 # False positive [download] NOTE1: Fails on AAA29 vs AAA30 and on BBC49 vs BBC50 for (the intended?) reading of OP's definition; if OP had specified "digit" instead of "number" this would not be a failure, but, as is, makes the problem challenging NOTE2: Fails on AAA37 vs (AAA30 or AAA31) by any reading; this is a glitch in the code /me offers above. NOTE3: Use any one of several mechanisms, including hashes, to eliminate dupes such as "AAA30;AAA31" and "AAA31;AAA30")	[reply] [d/l] [select]
Re: searching for strings by wind (Priest) on Aug 06, 2007 at 20:56 UTC
There are two things that make this much easier. One, is to iterate over the data twice, the first time to set up your relationships, the second time to report their existance. And two, rely solely on incrementing +1, as it is the easier of the relationships to calculate because of the special magic of ++ on strings. You still must define the characteristics of your boundary conditions though. Ideally, we want the increment and decrement relationships to be one to one. What is the increment of "Bar999"? If it is "Bar1000", then your relationship is no longer 1 to 1 as there is also "Bar0999". Anyway, here is an implementation without the boundary conditions fully defined. I've added two data entries for lines that do not match: use Fcntl qw(SEEK_SET); use strict; # Calculate Relationships. # - Rely on increment, as it's the easier of the two to calculate. my %decrement; my %increment; # Synonymous with existance. my $start_of_DATA = tell DATA; while (<DATA>) { chomp; my $item = $_; # Note concerning parsing # - This regex requires that a prefix exist. if ($item =~ m{ (\w+) ( (?<!\d)\d+ \| (?<![A-Z])[A-Z]+ ) \z}x) { my ($prefix, $suffix) = ($1, $2); # Note: this is the primary spot where there might be changes +in rules. # - What happens when the character ends in 'Z'? Currently # That would translate to 'AA'. # - What happens when number is 999? Currently that would tra +nslate # to '1000'. # Fix the rules here, and everything else will translate. (my $suffix_next = $suffix)++; my $item_next = $prefix . $suffix_next; $increment{$item} = $item_next; $decrement{$item_next} = $item; } else { die "Invalid data: $_"; } } # Reparse DATA seek(DATA, $start_of_DATA, SEEK_SET); while (<DATA>) { chomp; print; if ($increment{ $decrement{ $_ } }) { print ";$decrement{$_}"; } elsif ($increment{ $increment{ $_ } }) { print ";$increment{$_}"; } print "\n"; } __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 BBC8 BBC3 DAL33A BBC6 SHT12G BBC50 [download] And the output is: `>perl scratch.pl AAA30;AAA31 BBC5;BBC6 SHT12H;SHT12G DAL33B;DAL33A BBC49;BBC50 AAA31;AAA30 BBC8 BBC3 DAL33A;DAL33B BBC6;BBC5 SHT12G;SHT12H BBC50;BBC49` [download] - Miller	[reply] [d/l] [select]
Re^2: searching for strings by wind (Priest) on Aug 10, 2007 at 08:32 UTC
steph_bow asked "could you tell me what is the signification of (?<!\d) in your code ?" The purpose of the negative look behind assertions in my regular expression was to ensure that there was a key included in the data value. This was not strictly part of your stated requirement, but it seemed implied and therefore should be enforced. If this didn't matter, than we could simply rely on non-greedy matching in order to separate the key from the suffix. `while (<DATA>) { chomp; if (m{\A (.?) (\d+ \| [A-Z]+) \z}x) { printf "Key = %-6s - Suffix = %s\n", $1, $2; } else { die "Invalid data: $_"; } } __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 BBC8 BBC3 DAL33A BBC6 SHT12G BBC50` [download] However, what about the case of a value of '123456' or 'ABCDEF'? Currently those two values would validate to a key of the empty string "", and a suffix of the entire string. To avoid this, we start by changing the key matching from (0 or more) to + (1 or more). However, this will just eat up 1 character of our suffix. We therefore add negative look behind assertions to the suffix matching in order to ensure that the suffix is matched on a boundary. This might have been a little obsessive, but it's always a good idea to validate your data so that you are absolutely sure that the rules of your logic are being followed. - Miller	[reply] [d/l]
Re: searching for strings by RMGir (Prior) on Aug 06, 2007 at 11:02 UTC
Your problem _sounds_ complicated, but it's pretty simple, really. First off, the 2 matching strings have to be the same length, so you can use the length function to check that. Next, if you use substr to look at the last character of 2 strings, the first n-1 characters have to be equal and the last characters have to differ by 1 when compared using ord, or the strings aren't a match. Finally, you can use split to split your strings using a sequence of digits as the delimiter, like my @fields=split /(\d+)/,$str; Using the /(\d+)/, the delimiter will get captured too, and now you can make sure that the string fields are the same, and the number field differs only by one. Once you've got all that bundled into a comparison function, all you have to do is scan the list for matches for each string, and pair them up... I think you can get the rest of the way to your answer from here... Mike	[reply]
Re^2: searching for strings by BrowserUk (Patriarch) on Aug 06, 2007 at 11:12 UTC
Next, if you use substr to look at the last character of 2 strings, the first n-1 characters have to be equal and the last characters have to differ by 1 when compared using ord, or the strings aren't a match. That doesn't work for 'BBC49' and 'BBC50'. Also, the process you are describing is O(N²). Feasible, but impractical for large lists. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^2: searching for strings by Limbic~Region (Chancellor) on Aug 06, 2007 at 13:23 UTC
RMGir, First off, the 2 matching strings have to be the same length, so you can use the length function to check that. I think the problem is more complicated than you think. As BrowserUk points out - the OP has indicated 'BBC49' and 'BBC50' is a match. Even if you corrected for that, I think your first discriminating factor would be wrong. Consider ABC99 and ABC100. Cheers - L~R	[reply]
Re: searching for strings by Anonymous Monk on Aug 06, 2007 at 11:10 UTC
You can sort the input and then compare each string with its neighbors instead of using a hash.	[reply]
Re^2: searching for strings by CountZero (Bishop) on Aug 06, 2007 at 15:20 UTC
That would not work as `ABC9` and `ABC10` would not necessarily be next to each other. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re: searching for strings by injunjoel (Priest) on Aug 07, 2007 at 19:35 UTC
Late as usual I see... Here is my suggestion #!/usr/bin/perl use strict; { local %_; while(<DATA>){ chomp; if(/(\d+\|[A-Z])$/){ $_{substr($_,0,(length($_)-length($1)))}->{$1} = $_; } } for(sort keys %_){ if(scalar(keys %{$_{$_}}) % 2 == 0){ local @_ = sort{$a <=> $b \|\| ord($a) <=> ord($b)}keys %{$_ +{$_}}; for(my $i = 0; $i < scalar(@_); $i+=2){ if( ($_[$i] =~ /\d$/ && abs($_[$i] - $_[$i+1]) == 1) \| +\| (abs(ord($_[$i]) - ord($_[$i+1])) == 1) ){ print $_{$_}->{$_[$i]}.";".$_{$_}->{$_[$i+1]}."\n" + ; print $_{$_}->{$_[$i+1]}.";".$_{$_}->{$_[$i]}."\n" + ; } } } } } __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 BBC8 BBC3 DAL33A BBC6 SHT12G BBC50 [download] Output `AAA30;AAA31 AAA31;AAA30 BBC49;BBC50 BBC50;BBC49 DAL33A;DAL33B DAL33B;DAL33A SHT12G;SHT12H SHT12H;SHT12G` [download] Not exactly the OP's output but close... Just a thought :) -InjunJoel "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo	[reply] [d/l] [select]