parsing question

Washie101 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: parsing question by kilinrax (Deacon) on May 28, 2003 at 08:50 UTC
Unfortunately your question isn't terribly clear, so I'm not entirely sure what you're looking for. However, one thing I would suggest - if you want to match after the last occurance of something, it may be easier to apply a regex to a reversed string, e.g: `my $reverse = reverse $line; $reverse =~ s\| \w* ; \s* > \|>\|x; $line = reverse $reverse;` [download]	[reply] [d/l]
Re: Re: parsing question by Chady (Priest) on May 28, 2003 at 09:30 UTC
or maybe a greedy regex? `$line =~ s/^(.>) ;./\1;/;` [download] He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life. Chady \| http://chady.net/	[reply] [d/l]
Re: Re: Re: parsing question by kilinrax (Deacon) on May 28, 2003 at 11:06 UTC
In a word, no. Reversing the regex is much faster. Have a look at these benchmarks: #!/usr/bin/perl -w use strict; use Benchmark; my $string = "<<HTML>;nbsp dont_strip_me</HTML>> <xyzfdgfghgf> ;strip_ +me"; sub reversed { my $reverse = reverse(shift); $reverse =~ s\| \w* ; \s* > \|>\|x; return scalar reverse $reverse; } sub greedy { my $line = shift; $line =~ s\|^ (.>) \s ; \w* \|$1\|x; return $line; } print "Reversed: ", reversed($string), "\n"; print "Greedy: ", greedy($string), "\n"; timethese( -10,{ reversed => sub { reversed( $string ) }, greedy => sub { greedy( $string ) }, } ); [download] Output: `Reversed: <<HTML>;nbsp dont_strip_me</HTML>> <xyzfdgfghgf> Greedy: <<HTML>;nbsp dont_strip_me</HTML>> <xyzfdgfghgf> Benchmark: running greedy, reversed, each for at least 10 CPU seconds... greedy: 10 wallclock secs ( 9.98 usr + 0.02 sys = 10.00 CPU) @ 78480.80/s (n=784808) reversed: 11 wallclock secs (10.46 usr + 0.00 sys = 10.46 CPU) @ 167660.04/s (n=1753724)` As you can see, it's over twice the speed. On longer strings, the difference would be even greater. Also, your regex is wrong. Read through perldoc:perlre (specifically, the section marked '`Warning on \1 vs $1`') to discover why.	[reply] [d/l]
Re: parsing question by Zaxo (Archbishop) on May 28, 2003 at 12:54 UTC
Concentrating on basically i want to search and trim (0+ spaces);(0+chars) AFTER the last > as the actual requirement: `substr( $line, rindex( $line, '>')) =~ s/\s;\w//; # typo corrected, s/:/;/ in the regex` [download] That has a certain amount of magic in it that I should explain. The substr function is an lvalue, meaning that the string of its first argument is modifiable through it. The rindex function finds the last '>' in $line, making substr deal with only the portion of $line that follows that position. Effectively, the substitution is restricted to the part of $line that you specified. After Compline, Zaxo	[reply] [d/l]
Re: parsing question by TomDLux (Vicar) on May 28, 2003 at 12:58 UTC
What is generating this data? The first has a single HTML tag, with no closing tag, while the second has opening and closing tags. This first has a non-standard tag, while the second has valid HTML tags. The second has the chunk enclosed in angle brackets. is that what makes it acceptable? Does it matter that your tags are only acceptable HTML 4? XHTML requires lower case tags. Since HTML documents are not line-oriented, breaks can occur anywhere, or many components can be one one line. Is that relevant to your document?	[reply]
Re: parsing question by Wonko the sane (Deacon) on May 28, 2003 at 13:49 UTC
I like kilinrax use of reverse, I have never seen that trick before. Without knowing that I would have suggested a capturing regex, sort of a modification of the greedy suggestion. It benchmarks the fastest of the three. #!/usr/local/bin/perl use strict; use Benchmark; my $string = "<<HTML>;nbsp dont_strip_me</HTML>> <xyzfdgfghgf> ;strip_ +me"; sub reversed { my $reverse = reverse(shift); $reverse =~ s\| \w* ; \s* > \|>\|x; return scalar reverse $reverse; } sub greedy { my $line = shift; $line =~ s\|^ (.>) \s ; \w* \|$1\|x; return $line; } sub capture { my $line = shift; return $line =~ /^(.+>)/; } print "Reversed: ", reversed($string), "\n"; print "Greedy: ", greedy($string), "\n"; print "Capture: ", capture($string), "\n"; timethese( -10,{ reversed => sub { reversed( $string ) }, greedy => sub { greedy( $string ) }, capture => sub { capture( $string ) }, } ); [download] Output: :!./test.pl Reversed: <<HTML>;nbsp dont_strip_me</HTML>> <xyzfdgfghgf> Greedy: <<HTML>;nbsp dont_strip_me</HTML>> <xyzfdgfghgf> Capture: <<HTML>;nbsp dont_strip_me</HTML>> <xyzfdgfghgf> Benchmark: running capture, greedy, reversed, each for at least 10 CPU + seconds... capture: 10 wallclock secs (10.40 usr + 0.01 sys = 10.41 CPU) @ 53 +160.52/s (n=553401) greedy: 10 wallclock secs (10.52 usr + 0.00 sys = 10.52 CPU) @ 21 +887.07/s (n=230252) reversed: 11 wallclock secs (10.54 usr + 0.01 sys = 10.55 CPU) @ 36 +366.92/s (n=383671) [download] Wonko	[reply] [d/l] [select]


"be consistent"
	PerlMonks