Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
to get sth. out of html-tabs and lists I do:
sub splt { my $s = shift; $s =~ s/\x0A|\x0D//gs; # make one line return (split /<td|<tr|<li/gsi, $s); # return array }
But this takes a long time. I asume because of many copies of the slowly growing array.
Is there an easy way to speed this up?
Thanks in advance,
Carl

2005-03-22 Janitored by Arunbear - replaced pre tags with code tags, to allow code extraction

Replies are listed 'Best First'.
Re: speed up split?
by Roy Johnson (Monsignor) on Mar 22, 2005 at 16:42 UTC
    I'm pretty sure you don't want the /g option to your regex in split, and the /s option is pointless with the pattern you're using. Also, your split will be throwing away those opening tag fragments. Is that what you want it to do?

    Your s/// would be better written with tr///.

    The HTML parsing portion of your program might be better replaced with HTML::TokeParser::Simple


    Caution: Contents may have been coded under pressure.
      well,
      in my tabs and lists the meaning of an item is given by the items before. It is not just searching some strings. And to program such a parser seems to me as complicated as doing it myself line by line.

      But I would like to try tr///, but I feel a little bit uncertain. To get ird of all \n and \r, is that correct?

         $s =~ tr/\x0D\x0A//d;
      
      Or is ther a better solution?
      Thanks in advance,
      Carl
        Your use of tr is correct. It's a little easier to read if you just say
        tr/\r\n/ /s;
        I am replacing runs of \n and/or \r with single spaces.

        Caution: Contents may have been coded under pressure.
Re: speed up split?
by brian_d_foy (Abbot) on Mar 22, 2005 at 16:52 UTC

    What are you trying to do?

    What you aren't trying to do is make whitespace disappear, so you really don't need to get rid of newlines, although if you do, you should replace them with a space. Words tend to get smashed together otherwise.

    I'm not sure what you want to do in the split(), but you probably don't want the /g modifier, and the /s modifier only matters if you use a . in the pattern (and the target string has a newline, but you just got rid of all of those).

    You might do better with an HTML::Parser, but I'm not sure what you are doing.

    --
    brian d foy <bdfoy@cpan.org>