jkeenan1 has asked for the wisdom of the Perl Monks concerning the following question:
I have a solution to a pattern matching problem, but I seek the assistance of the monks in finding a more elegant solution.
The input is a string consisting of substrings which are URLs (starting with either 'http' or 'https' delimited, more or less, by commas. I say "more or less" because some of the URLs may themselves contain commas -- a business requirement I don't pretend to defend. Hence, it would be neither sufficient nor correct to split the input string on a comma. We only want to split on commas which precede 'http'.
$input = q|http://abc.org,http://de,f.org,https://ghi.org|;
We should be able to capture 3 URLs from the above input string:
@captures = ( 'http://abc.org', 'http://de,f.org', 'https://ghi.org', );
To further complicate the problem, we want to set a maximum number of URLs to be captured from the input string. Any URLs in excess of the maximum (which should be configurable) should not be captured and may be ignored.
Suppose we are to capture only the first two of the URLs in the input string above. In that case, our results would be:
@captures = ( 'http://abc.org', 'http://de,f.org', );
I spent several hours on this problem today. I was hoping that I could write one regular expression which would be applied just once to the input string and return the correct URLs. I did not succeed in that, but I managed to write a pattern which, repeatedly applied to the input string within a subroutine, gave me the intended results. The test file which follows works for me -- but can anyone suggest a simpler solution?
use strict; use warnings; use 5.010_001; use Test::More; my @raw_inputs = ( 'http://abc.org', 'http://de,f.org', 'https://ghi.org', 'http://jkl.org', ); my @inputs = ( $raw_inputs[0] ); for my $q (1..3) { push @inputs, join(',' => @raw_inputs[0..$q]); } is_deeply( _recognize_limited_urls($inputs[0], 3), [ $raw_inputs[0] ], "1 URL", ); is_deeply( _recognize_limited_urls($inputs[1], 3), [ @raw_inputs[0..1] ], "2 URLs (one containing a comma)", ); is_deeply( _recognize_limited_urls($inputs[2], 3), [ @raw_inputs[0..2] ], "3 URLs (one containing a comma)", ); is_deeply( _recognize_limited_urls($inputs[3], 3), [ @raw_inputs[0..2] ], "Still only 3 URLs (one containing a comma); reject those over max +", ); done_testing(); sub _recognize_limited_urls { my ($input, $max) = @_; my $str = $input; my $pattern = qr/^(http.*?)(?:,(http.*?))?$/; my $count = 0; my @captures = (); LOOP: while ($count < $max) { my ($capture, $balance); if ($str and $str =~ m/$pattern/) { ($capture, $balance) = ($1, $2); push @captures, $capture; $str = $balance; $count++; } else { last LOOP; } } return \@captures; }
Thank you very much.
Jim Keenan
|
---|