Yup, the quick and dirty way is to use CPAN (see Yes, even you can use CPAN). If you spend a little time once (possibly less time than you've spent trying to solve this one parsing problem) putting together a distribution system for your users, then you can include whatever modules (CPAN and your own) your script requires with no impact on the users install experience. That's a one time cost and a possible many time payback with the additional payback of reduced development time and probably lower maintenance costs.
Perl reduces RSI - it saves typing
| [reply] |
XML::TokeParser will return tokenized XML. You could write a wrapper around XML::TokeParser if you need the format you listed exactly.
Update: Oops, Sorry, I missed your last paragraph.
| [reply] |
I thought about using HTML::TokeParse to get the attributes out, but I am trying to avoid installing the pm on several hundred ws.
You might as well prepare for the eventuality, esp since apt-get perl-HTML-Parser is available.
Is there a quick and dirty way to get this string tokenized? Thanks
One example is strip HTML tags, another is YAPE::HTML
| [reply] [d/l] |
While agreeing with everything everyone has said above, I thought I might as well : use warnings;
use strict;
my $string = q(<a style='postion: top; font:roman' href=hi.html href
+='bold' src=" image" />);
my @tokens = $string =~ /<?\s*(\w+=([\'\"]).*?\2|[^\s>]+)/g;
# Get every other token
my $i = 0;
@tokens = grep {++$i % 2} @tokens;
local $" = "\n";
print "@tokens\n";
This gives:a
style='postion: top; font:roman'
href=hi.html
href='bold'
src=" image"
/
This will only work with this data format, i.e. with the token=quoted data layout. The RE uses a back-reference for the second quote because you might have quotes-in-quotes. This complicates the result array, hence the grep.
You really are better off using a CPAN module for the general case. | [reply] [d/l] [select] |
If the rule is "no additional modules, please!", maybe the simplest (but, admittedly, very dirty way) is to loop over the individual characters of the string - kind of "good old C style" solution. That way it is easy to memorize whether or nor you are inside a quote. Note that you can easily get an array of the individual characters with split(//,$string).
A second possibility I could think of, would be using split(' ',$string), that means, you first pretend that you are not interested in handling the quotes correctly. This gives you an array of fields (some of them contain maybe only one quote), plus (because of the "magic" first argument to split some empty fields for the white spaces. Now you loop through this list. Whenever an element contains only one quote, you have to join it to the next element having a quote (and if the syntax of your string is correct, that element will also have exactly one quote). If you have in between empty elements, you have to treat them as spaces.
Having said this, I personally would prefer the former solution....
--
Ronald Fischer <ynnor@mm.st>
| [reply] [d/l] [select] |