ultranerds has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

Ok, let me see if I can explain myself :)

Basically, if we have a string like;

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

...we them split this up into sections, like:

Lorem Ipsum is simply dummy text of the printing and typesetting indus +try. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled + it to make a type specimen book. It has survived not only five centuries, but also the l +eap into electronic typesetting, remaining essentially unchanged. It was popularised in th +e 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more r +ecently with desktop publishing software like Aldus PageMaker including versions of Lorem I +psum.


Now, for some of those phrases, we wanna remove anything before any of these charachters:

! , . : and )

For example, this:

typesetting, remaining essentially unchanged. It was popularised in the 1960s with the

..would become:

remaining essentially unchanged. It was popularised in the 1960s with the

Now, I've got a function that does this:

sub cleanup_start_words { my $text = $_[0]; my @split = split //, $text; my @keywords = split //, $_[1]; return $text; # see if we have any charachters we wanna skip in the first 10 cha +rachters my $remove_at; my $do_remove = 0; for (my $x = 0; $x < 40; $x++) { if ($split[$x] =~ /[\.\!\?,\)\:\:]/) { $do_remove = 1; $remove_at = $x; } } if ($do_remove) { my $i = 0; foreach (@split) { $i++; if ($i > $remove_at) { last; } if (m/[\.\!\?,]\)\:/) { # print "skipping [last] $_ \n"; last; } else { # print "skipping $_ \n"; } } # didn't seem to work right when doing it in the foreach above, s +o get rid of the # charachters we dont want here for (my $ii = 0; $ii < $i; $ii++) { shift @split; } my $tmp = join("",@split); $tmp =~ s/^[\.\!\?,\)\:]//; $tmp =~ s/^\s+//; return $tmp; } else { return $text; } }


..but I'm wondering if there is maybe a regex we could use, which would be more effecient?

I'm trying to knock of crutial miliseconds from this, cos the new feature has added about .1 of a second to each request (it doesn't only consist of the above code - there is a lot else going on =))

Anyone got any suggestions? Please note, this is a non-english site, so will need to work with accented charachters etc.

TIA!

Andy

Replies are listed 'Best First'.
Re: Way to "trim" part of a phrase?
by BioLion (Curate) on Aug 14, 2009 at 08:25 UTC

    Once you have divided up the block into phrases, would not a substitution be the most efficient way of stripping the unwanted bits?

    s/[^\.!\,\:\)]++[\.!\,\:\)]//

    I haven't tested this, it is just a thought - i.e. greedy match non-target characters followed by a single target character, and replace it with nothing?

    And if you haven't already thought of it Benchmark is good for comparing any alternate ways you can think of doing this. And if speed is really an issue Devel::NYTProf is pretty rock and roll for optimising code usage! HTH!

    Just a something something...
      Hi,

      Thanks for the reply. However, it doesn't seem to do anything :(

              $text =~ s/[^\.!\,\:\)]+[\.!\,\:\)]//;

      pour 2 personnes, achetez deux motos tout le monde je prépare un un projet de voyage assez similaire

      ..still comes out like that,instead of how it should be, with:

      achetez deux motos tout le monde je prépare un un projet de voyage assez similaire

      Any suggestions?

      Re benchmarking - we are already using the Benchmark on, to keep track on speed stuff (as its a large site, even a small amount of CPU increase can be a major headache for us)

      TIA!

      Andy
        Never mind - there was a typo in your regex (you had ^ inside the [ bit =)) This works:
        my $string = q|pour 2 personnes, achetez deux motos tout le monde je p +répare un un projet de voyage assez similaire|; print qq|OLD STRING: $string \n|; $string =~ s/[^\.!\,\:\)]+[\.!\,\:\)]//; print qq|NEW STRING: $string|;
        Thanks again. Andy
Re: Way to "trim" part of a phrase?
by graff (Chancellor) on Aug 15, 2009 at 02:05 UTC
    I'm not able to understand what you are trying to accomplish with the OP code. I'm especially puzzled by the first "return" statement, which is the fourth expression in the subroutine:
    sub cleanup_start_words { my $text = $_[0]; my @split = split //, $text; my @keywords = split //, $_[1]; return $text; # nothing from here down ever gets executed.
    Apart from that, if any of the subsequent lines were to be reached and executed, it seems like they're doing a lot of unnecessary work.

    Maybe I'm misunderstaning what you're really trying to do, but if the goal is to eliminate the initial portion of a string up to and including the first occurrence of any of these five characters: [!,.:)] -- then a quick/easy way would be:

    sub delete_initial_phrase { my ( $phrase ) = @_; if ( $phrase =~ /([!,.:)])/ ) { my $punc = $1; return substr( $phrase, 1 + index( $phrase, $punc )); } else { return $phrase; } }
    That will return the original phrase if it contains none of the targeted characters. When any of those characters is present, it will return the string that starts with the very next character (usually a space).

    On thinking a bit more about the process, you probably want something slightly different, in case you run into examples like this:

    blah blah!) Foo bar, and so on...
    Do you want the return value to be ") Foo bar, and so on...", or would you rather remove the initial paren along with the exclamation? If the latter, just change the initial regex:
    if ( $phrase =~ /([!,.:)]+)/ ) {
    The rest stays the same. Then you also have to change the substr call:
    return substr( $phrase, length($punc) + index( $phrase, $punc +));

    (updated to fix a couple typos -- and to add the point about modifying the substr call)