rhymejerky has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a string that looks like below (please disregard that it is not a correct tag). I am trying to figure out a way to tokenize it. The tokens for this example would be
<a style='postion: top; font:roman' href=hi.html href='bold' src=" i +mage" /> token 1 = a token 2 = style='postion: top; font:roman' token 3 = href=hi.html token 4 = href='bold' token 5 = src=" image" token 6 = /
It is basically white spaces delimited unless it has a quote. I thought about using HTML::TokeParse to get the attributes out, but I am trying to avoid installing the pm on several hundred ws. Is there a quick and dirty way to get this string tokenized? Thanks --------------------- After looking through all the comment, I decided to go with installing the pm. It makes more sense to pay for the package installation than dealing with maintaining the regexp. Thanks for the replies

Replies are listed 'Best First'.
Re: tokenize a string
by GrandFather (Saint) on Oct 08, 2008 at 07:21 UTC

    Yup, the quick and dirty way is to use CPAN (see Yes, even you can use CPAN). If you spend a little time once (possibly less time than you've spent trying to solve this one parsing problem) putting together a distribution system for your users, then you can include whatever modules (CPAN and your own) your script requires with no impact on the users install experience. That's a one time cost and a possible many time payback with the additional payback of reduced development time and probably lower maintenance costs.


    Perl reduces RSI - it saves typing
Re: tokenize a string
by ikegami (Patriarch) on Oct 08, 2008 at 05:50 UTC

    XML::TokeParser will return tokenized XML. You could write a wrapper around XML::TokeParser if you need the format you listed exactly.

    Update: Oops, Sorry, I missed your last paragraph.

Re: tokenize a string
by Anonymous Monk on Oct 08, 2008 at 06:49 UTC
    I thought about using HTML::TokeParse to get the attributes out, but I am trying to avoid installing the pm on several hundred ws.
    You might as well prepare for the eventuality, esp since apt-get perl-HTML-Parser is available.

    Is there a quick and dirty way to get this string tokenized? Thanks
    One example is strip HTML tags, another is YAPE::HTML

Re: tokenize a string
by cdarke (Prior) on Oct 08, 2008 at 08:05 UTC
    While agreeing with everything everyone has said above, I thought I might as well :
    use warnings; use strict; my $string = q(<a style='postion: top; font:roman' href=hi.html href +='bold' src=" image" />); my @tokens = $string =~ /<?\s*(\w+=([\'\"]).*?\2|[^\s>]+)/g; # Get every other token my $i = 0; @tokens = grep {++$i % 2} @tokens; local $" = "\n"; print "@tokens\n";
    This gives:
    a style='postion: top; font:roman' href=hi.html href='bold' src=" image" /
    This will only work with this data format, i.e. with the token=quoted data layout. The RE uses a back-reference for the second quote because you might have quotes-in-quotes. This complicates the result array, hence the grep.

    You really are better off using a CPAN module for the general case.
Re: tokenize a string
by rovf (Priest) on Oct 08, 2008 at 09:05 UTC

    If the rule is "no additional modules, please!", maybe the simplest (but, admittedly, very dirty way) is to loop over the individual characters of the string - kind of "good old C style" solution. That way it is easy to memorize whether or nor you are inside a quote. Note that you can easily get an array of the individual characters with split(//,$string).

    A second possibility I could think of, would be using split(' ',$string), that means, you first pretend that you are not interested in handling the quotes correctly. This gives you an array of fields (some of them contain maybe only one quote), plus (because of the "magic" first argument to split some empty fields for the white spaces. Now you loop through this list. Whenever an element contains only one quote, you have to join it to the next element having a quote (and if the syntax of your string is correct, that element will also have exactly one quote). If you have in between empty elements, you have to treat them as spaces.

    Having said this, I personally would prefer the former solution....

    -- 
    Ronald Fischer <ynnor@mm.st>