Regexp string concat

Murcia has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regexp string concat by delirium (Chaplain) on Jun 14, 2004 at 21:32 UTC
I think we can solve this without doing any fancy sorting or iterating. How about a simple regular expression approach using upper and lower case characters as our markers? `#!/usr/bin/perl use strict; use warnings; $\ = $/; my $string = 'eichenbaumschule'; my @query = qw(baum ums eic chu le); $string =~ s/$_/uc $_/ieg for @query; @query = $string =~ /([A-Z]+)/g; print for @query; __OUTPUT__ EIC BAUMSCHULE` [download]	[reply] [d/l]
Re^2: Regexp string concat by BrowserUk (Patriarch) on Jun 14, 2004 at 23:12 UTC
This is really neat++. As the target is genome work which usually involves large volumes of big strings, using `split '[a-z]+', string;` to avoid the capture brackets will save a little time. `sub delirium{ my( $string, $qRef ) = @_; $string =~ tr[A-Z][a-z]; $string =~ s/$_/uc $_/ieg for @$qRef; return reduce{ length $a > length $b ? $a : $b } split '[a-z]+',; }` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]
Re^3: Regexp string concat by Jasper (Chaplain) on Jun 21, 2004 at 09:38 UTC
An optimisation I've thought of for this, even though the thread is long dead, is to lc the original string, and have all the strings to match as upper case, then all you need do is: `$string =~ s/$_/$_/i for @$qRef;` I'm sure the regex is considerably faster for having taken the executable bit out of it.	[reply] [d/l]
Re^2: Regexp string concat by runrig (Abbot) on Jun 14, 2004 at 22:33 UTC
For a slight speed boost, use compiled regular expressions: `my $string = 'eichenbaumschule'; my @query = qw(baum ums eic chu le); @query = map { qr/($_)/i } @query; $string =~ s/$_/\U$1/g for @query;` [download] (I get ~6000 iterations/sec this way vs. ~3000 your way). Update: Just realized, this is only a factor if you do the same substitutions multiple times in the same program...if you're only doing this once per run, then you may as well do it delirium's way :-)	[reply] [d/l]
Re: Regexp string concat by CombatSquirrel (Hermit) on Jun 14, 2004 at 17:37 UTC
I hacked this piece of code together. It may work for you or it may not. I think it is pretty straightforward, except for the two-level Schwartzian transform-like anonymous array nesting (Data::Dumper should help, though)... #!perl use strict; use warnings; my $string = "EICHENBAUMSCHULE"; my @query = qw/EIC BAUM UMS CHU LE/; @query = sort { $a->[1] <=> $b->[1] \|\| $a->[2] <=> $b->[2] } grep { $_->[1] > -1 } map { [$_, index($string, $_), index($string, $_) + length($ +_) - 1] } @query; my @matches; @query or die "No matches"; my $i = -1; my $p = -2; for (@query) { if ($p + 1 < $_->[1]) { ++$i; } push @{$matches[$i]}, $_; $p = $_->[2]; } @matches = sort { $b->[0] <=> $a->[0] } map { [$_->[-1]->[2] - $_->[0]->[1] + 1, $_] } @matches; print "Longest continuous match has length $matches[0]->[0]\n\n"; print "$string\n"; for (@{$matches[0]->[1]}) { print ' ' x $_->[1] . $_->[0] . "\n"; } [download] Hope this helped. CombatSquirrel. Entropy is the tendency of everything going to hell.	[reply] [d/l]
Re: Regexp string concat by dragonchild (Archbishop) on Jun 14, 2004 at 17:16 UTC
First - doesn't BioPerl have a few functions that will do this? As for code ... you still don't have all the requirements laid out yet. Specifically: Do the matches have to be in the order specified? For example, what should @new_query contain if $string = 'EICHENLEBAUMSCHU'? What happens if something in @query isn't there? For example, $string = 'XXX'. What happens if there is more than one match? For example, $string = 'EICBAUMEIC'. What happens if something isn't there and there's two matches on something else? What happens if one element in @query fits into another element? For example, @query = qw( BAUM AU ). What happens if there are two possible solutions? For example, $string = 'BAABAB' and @query = qw( AA BA ). I'm sure there are other possibilities that I didn't think of in the first five minutes of looking at the problem. Remember - if you can't explain the problem to a teddy bear, you can't explain the problem to a computer. ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply]
Re^2: Regexp string concat by Murcia (Monk) on Jun 14, 2004 at 17:28 UTC
First - doesn't BioPerl have a few functions that will do this? I don't think so. As for code ... you still don't have all the requirements laid out yet. Specifically: * Do the matches have to be in the order specified? For example, what should @new_query contain if $string = 'EICHENLEBAUMSCHU'? @new_query = qw( EIC, LEBAUMSCHU); * What happens if something in @query isn't there? For example, $string = 'XXX'. @new_query = @query * What happens if there is more than one match? For example, $string = 'EICBAUMEIC'. will not be the case * What happens if something isn't there and there's two matches on something else? remove double * What happens if one element in @query fits into another element? For example, @query = qw( BAUM AU ). @new_query = qw(BAUM) * What happens if there are two possible solutions? For example, $string = 'BAABAB' and @query = qw( AA BA ). will not be the case I'm sure there are other possibilities that I didn't think of in the first five minutes of looking at the problem. Remember - if you can't explain the problem to a teddy bear, you can't explain the problem to a computer. >thanks for good comments Murcia	[reply]
Re^3: Regexp string concat by dragonchild (Archbishop) on Jun 14, 2004 at 17:41 UTC
Heh. Your requirements seemingly contract each other. Specifically, you state that @new_query = @query if a value in @query isn't matched. However, you say that if something isn't matched and something else is matched twice, remove the double. But, you say that there can never be a double match. In other words, it doesn't sound like you can explain the requirements to a teddybear. I would spend some time with pencil and paper before going much further. Then, when you have pencil and paper down, convert those into test cases before writing any code. ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply]
Re: Regexp string concat by BrowserUk (Patriarch) on Jun 14, 2004 at 18:13 UTC
Sorry for the false start, hopefully this does the job. `#! perl -slw use strict; use List::Util qw[ reduce ]; sub longestComposite{ my( $string, $qRef ) = @_; my $copy = ' ' x length $string; $string =~ s[($_)]{ my $len = length $1; substr( $copy, pos( $string ), $len ) = $1; $1; }ge for @$qRef; return reduce{ length $a > length $b ? $a : $b } split ' ', $copy; } my $string ="EICHENBAUMSCHULE"; my @query = qw( EIC BAUM UMS CHU LE ); print longestComposite( $string, \@query ); __END__ P:\test>366569 BAUMSCHULE` [download] No promises about efficiency if you're doing this on large volumes of data. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l]
Re: Regexp string concat by BrowserUk (Patriarch) on Jun 14, 2004 at 17:00 UTC
This doesn't handle any error conditions, like if the query contains non-matches, but it may give you a starting point. `#! perl -slw use strict; sub display{ my $string = shift; return $string . $/, map{ ' ' x index( $string, $_ ) . "$_\n" } @_; } my $string ="EICHENBAUMSCHULE"; my @query = qw( EIC BAUM UMS CHU LE ); print display( $string, @query ); __END__ P:\test>366569 EICHENBAUMSCHULE EIC BAUM UMS CHU LE` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l]
Re^2: Regexp string concat by Murcia (Monk) on Jun 14, 2004 at 17:13 UTC
but this does not give me the longest concatenated string "BAUMSCHULE"	[reply]
Re: Regexp string concat by davidj (Priest) on Jun 14, 2004 at 16:53 UTC
questions regarding the resulting query_string: do the overlapping and/or touching matches have to be in the same order in which they are indicated in the @query array. That is, given: `my $string ="EICHENBAUMSCHULE"; my @query = qw(EIC BAUM UMS CHU LE);` [download] would the following also match `<BR> BAUMSLECHU # the CHU and LE are not in the same order as indicated in + @query` [download] davidj	[reply] [d/l] [select]
Re: Regexp string concat by Happy-the-monk (Canon) on Jun 14, 2004 at 15:15 UTC
(please fix that missing closing `</pre>`-tag in your node. Thank you.) Sounds like homework to me. I don't think having your homework done by the community is the right way to get through school - or life for that matter. Tell us your thoughts on that matter and tell us your difficulties or problems, how you tried to solve them, where your insecurities are: we will help you along then. Cheers, Sören	[reply]
Re^2: Regexp string concat by Murcia (Monk) on Jun 14, 2004 at 15:25 UTC
S�ren, thanks about your comment ... I found a more or less working solution, but I do not like the code I wrote ... and do not want to annoy with it so I start just giving the task The real task is coming from a Genome project ... Displaying peptides in a protein sequence ...	[reply]
Re: Regexp string concat by Jasper (Chaplain) on Jun 15, 2004 at 14:41 UTC
Bit-wise oring is your friend. Although the answer comes out lower-cased :( - unless I change those pesky spaces to '@'s - which I just did. `my $string ="EICHENBAUMSCHULE"; my @query = qw(EIC BAUM UMS CHU LE); my $matchline = '@' x length $string; for $match(@query) { my $spacer = index $string, $match; $matchline \|= (('@' x $spacer) . $match) if $spacer > -1; } print for sort { length($b) <=> length($a) } $matchline =~ /[^@]+/g;` [download]	[reply] [d/l]