tokenize a string

rhymejerky has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a string that looks like below (please disregard that it is not a correct tag). I am trying to figure out a way to tokenize it. The tokens for this example would be

<a style='postion: top;   font:roman' href=hi.html href='bold' src=" i
+mage" />

token 1 = a
token 2 = style='postion: top; font:roman'
token 3 = href=hi.html
token 4 = href='bold'
token 5 = src=" image"
token 6 = /
[download]

It is basically white spaces delimited unless it has a quote. I thought about using HTML::TokeParse to get the attributes out, but I am trying to avoid installing the pm on several hundred ws. Is there a quick and dirty way to get this string tokenized? Thanks --------------------- After looking through all the comment, I decided to go with installing the pm. It makes more sense to pay for the package installation than dealing with maintaining the regexp. Thanks for the replies

Comment on tokenize a string Download Code

Replies are listed 'Best First'.
Re: tokenize a string by GrandFather (Saint) on Oct 08, 2008 at 07:21 UTC
Yup, the quick and dirty way is to use CPAN (see Yes, even you can use CPAN). If you spend a little time once (possibly less time than you've spent trying to solve this one parsing problem) putting together a distribution system for your users, then you can include whatever modules (CPAN and your own) your script requires with no impact on the users install experience. That's a one time cost and a possible many time payback with the additional payback of reduced development time and probably lower maintenance costs. Perl reduces RSI - it saves typing	[reply]
Re: tokenize a string by ikegami (Patriarch) on Oct 08, 2008 at 05:50 UTC
XML::TokeParser will return tokenized XML. You could write a wrapper around XML::TokeParser if you need the format you listed exactly. Update: Oops, Sorry, I missed your last paragraph.	[reply]
Re: tokenize a string by Anonymous Monk on Oct 08, 2008 at 06:49 UTC
I thought about using HTML::TokeParse to get the attributes out, but I am trying to avoid installing the pm on several hundred ws. You might as well prepare for the eventuality, esp since `apt-get perl-HTML-Parser` is available. Is there a quick and dirty way to get this string tokenized? Thanks One example is strip HTML tags, another is YAPE::HTML Read more... (1474 Bytes)	[reply] [d/l]
Re: tokenize a string by cdarke (Prior) on Oct 08, 2008 at 08:05 UTC
While agreeing with everything everyone has said above, I thought I might as well : `use warnings; use strict; my $string = q(<a style='postion: top; font:roman' href=hi.html href +='bold' src=" image" />); my @tokens = $string =~ /<?\s(\w+=([\'\"]).?\2\|[^\s>]+)/g; # Get every other token my $i = 0; @tokens = grep {++$i % 2} @tokens; local $" = "\n"; print "@tokens\n";` [download] This gives: `a style='postion: top; font:roman' href=hi.html href='bold' src=" image" /` [download] This will only work with this data format, i.e. with the token=quoted data layout. The RE uses a back-reference for the second quote because you might have quotes-in-quotes. This complicates the result array, hence the grep. You really are better off using a CPAN module for the general case.	[reply] [d/l] [select]
Re: tokenize a string by rovf (Priest) on Oct 08, 2008 at 09:05 UTC
If the rule is "no additional modules, please!", maybe the simplest (but, admittedly, very dirty way) is to loop over the individual characters of the string - kind of "good old C style" solution. That way it is easy to memorize whether or nor you are inside a quote. Note that you can easily get an array of the individual characters with `split(//,$string)`. A second possibility I could think of, would be using `split(' ',$string)`, that means, you first pretend that you are not interested in handling the quotes correctly. This gives you an array of fields (some of them contain maybe only one quote), plus (because of the "magic" first argument to `split` some empty fields for the white spaces. Now you loop through this list. Whenever an element contains only one quote, you have to join it to the next element having a quote (and if the syntax of your string is correct, that element will also have exactly one quote). If you have in between empty elements, you have to treat them as spaces. Having said this, I personally would prefer the former solution.... -- Ronald Fischer <ynnor@mm.st>	[reply] [d/l] [select]