nikos has asked for the wisdom of the Perl Monks concerning the following question:

Hello!

I have a string. It looks like 'just a test'. Saw a solution posted before.
$text=~s/([^ ]{3})(?=[^ ])/$1-/g; # split words longer than 3 characters
I get 'jus-t a sam-ple' after I run this code.
That's totally what I'm looking for.

But the problem is I might have special tags in this string like 'ju[color=1]s[/color]t a tes[color=2]t[/color]'. These tags are converted to HTML tags later. I need to skip [...] when I count number of characters in a word. The output should be "ju[color=1]s-[/color]t a tes-[color=2]t[/color]" or "ju[color=1]s[/color]-t a tes-[color=2]t[/color]". Both variants are ok.

It should have a pretty easy solution using regex. But I don't know how to do it.

Thank you!

Replies are listed 'Best First'.
Re: Problem: how to split long words
by ccn (Vicar) on Aug 24, 2004 at 14:56 UTC

    $_ = 'ju[color=1]s[/color]t a sam[color=2]p[/color]le'; my $brackets = qr(\[[^\]]*\]); # text enclosed in brackets my $char = qr([^\[\]\s]); # not spaces or brackets s{ ( # group to $1 (?: $char # a char (?:$brackets*) # followed by any number of brackets ) {3} # 3 times ) (?:$brackets*) # followed by any number of brackets (?= # looking forward to ensure that $char # we have at least one char [^\[\]]* # and we are not inside of a brackets (?: \[ | \z ) ) } {$1-}gx; print;

    Update: fixed "last '-'" issue pointed by ikegami
    P.S. Concerning '[2345678901234567890' I suggest that is not valid input.

      Two small issues: 'ju[color=1]s[/color]t a tesss[color=2]t[/color]' will return a string ending with a '-', and '[2345678901234567890' will result in a 20 letter word.

        I noticed the end '-' thing too. I did add a section to ccn's code that handles that. Its tagged w/ the "# ADDED" comment.

        ++ btw ccn, took me awhile to wrap my head around this one.

        my $brackets = qr(\[[^\]]*\]); # text enclosed in brackets my $char = qr([^\[\]\s]); # not spaces or brackets s{ ( # group to $1 (?: $char # a char (?:$brackets*) # followed by any number of brackets ) {3} # 3 times ) (?!(?:$brackets*)(?:\s|\Z)) # ADDED: eliminates '-' on the end of wo +rds w/ a multiple of 3 chars (?= # looking forward to ensure that [^\[\]]* # we are not inside of a brackets (?: \[ | \z ) ) } {$1-}gx; print;


        -jbWare

        Here is non-regexp solution. It validates input in addition to the main task.

        sub splitlong { my $text = shift; my @result; my ($cnt, $inside) = (0, 0); for (split //, $text) { ($inside = 0), next if $inside and $_ eq ']'; $inside = 1 if !$inside and $_ eq '['; next if $inside; $cnt = /\s/s ? 0 : $cnt + 1; ($cnt = 0), push @result, '-' if $cnt > 3; } continue { push @result, $_; } warn "invalid input: $text" if $inside; return join '', @result; } my $text = 'ju[color=1]s[/color]t a sam[color=2]p[/color]le'; print splitlong($text);
      Thank your for you code.
      It works fine with a valid input i.e [tag].

      If we do not check for correct [] pairs and fix it, it doesn't work. I understand it was not the task but still... We cannot just count the number of brackets in the string to check. The check has to be more complicated checking actually tags. To open a tag use '[', to close the tag use ']' not ']]' or not '[['.

      I think a little check might help. Split the word followed by ']' if the last bracket is not '['.
      $_ = 'sss[[]sssss]sssssss'; print $_."\n"; my $brackets = qr(\[[^\]]*\]); # text enclosed in brackets my $char = qr([^\[\]\s]); # not spaces or brackets s{ ( # group to $1 (?: $char # a char (?:$brackets*) # followed by any number of brackets ) {3} # 3 times ) (?!(?:$brackets*)(?:\s|\Z)) # ADDED: eliminates '-' on the end of wo +rds w/ a multiple of 3 chars (?= # looking forward to ensure that [^\[\]]* # we are not inside of a brackets (?: \[ | \z ) ) } {$1-}gx; print;
      Thank you.
Re: Problem: how to split long words
by ikegami (Patriarch) on Aug 24, 2004 at 14:48 UTC

    So far, I have:

    $text = 'ju[color=1]s[/color]t a tes[color=2]t[/color]'; $text =~ s/ ( (?: (?: \[ [^]]* \] )* [^ [] ){3} (?: \[ [^]]* \] )* ) (?=[^ ]) /$1-/gx; print($text, "\n"); # ju[color=1]s[/color]-t a tes[color=2]-t[/co-lor-]

    It's not quite there yet, but I have to go.

Re: Problem: how to split long words
by nikos (Scribe) on Sep 01, 2004 at 14:13 UTC
    I think this problem and it's solutions should be copied to Q&A when we have everything said and done. It will be a 'must-have' solution for everyone who is doing some kind of web-chats or discussion boards.
Re: Problem: how to split long words
by nikos (Scribe) on Sep 07, 2004 at 13:47 UTC
    The following fix for the code provided by ccn.

    Should be
    ($cnt = 1), push @result, '-' if $cnt > 3; not
    ($cnt = 0), push @result, '-' if $cnt > 3;
    If $cnt = 0 then the 2nd, 3rd... group of characters in each word will be (N+1) long. In the following example the first group will be 3 characters long and all the rest groups will be 4 characters long
    sub splitlong { my $text = shift; my @result; my ($cnt, $inside) = (0, 0); for (split //, $text) { ($inside = 0), next if $inside and $_ eq ']'; $inside = 1 if !$inside and $_ eq '['; next if $inside; $cnt = /\s/s ? 0 : $cnt + 1; ($cnt = 1), push @result, '-' if $cnt > 3; } continue { push @result, $_; } warn "invalid input: $text" if $inside; return join '', @result; } my $text = 'ju[color=1]s[/color]t a sam[color=2]p[/color]le'; print splitlong($text);