Monks:

I have a solution to a pattern matching problem, but I seek the assistance of the monks in finding a more elegant solution.

The input is a string consisting of substrings which are URLs (starting with either 'http' or 'https' delimited, more or less, by commas. I say "more or less" because some of the URLs may themselves contain commas -- a business requirement I don't pretend to defend. Hence, it would be neither sufficient nor correct to split the input string on a comma. We only want to split on commas which precede 'http'.

$input = q|http://abc.org,http://de,f.org,https://ghi.org|;

We should be able to capture 3 URLs from the above input string:

@captures = ( 'http://abc.org', 'http://de,f.org', 'https://ghi.org', );

To further complicate the problem, we want to set a maximum number of URLs to be captured from the input string. Any URLs in excess of the maximum (which should be configurable) should not be captured and may be ignored.

Suppose we are to capture only the first two of the URLs in the input string above. In that case, our results would be:

@captures = ( 'http://abc.org', 'http://de,f.org', );

I spent several hours on this problem today. I was hoping that I could write one regular expression which would be applied just once to the input string and return the correct URLs. I did not succeed in that, but I managed to write a pattern which, repeatedly applied to the input string within a subroutine, gave me the intended results. The test file which follows works for me -- but can anyone suggest a simpler solution?

use strict; use warnings; use 5.010_001; use Test::More; my @raw_inputs = ( 'http://abc.org', 'http://de,f.org', 'https://ghi.org', 'http://jkl.org', ); my @inputs = ( $raw_inputs[0] ); for my $q (1..3) { push @inputs, join(',' => @raw_inputs[0..$q]); } is_deeply( _recognize_limited_urls($inputs[0], 3), [ $raw_inputs[0] ], "1 URL", ); is_deeply( _recognize_limited_urls($inputs[1], 3), [ @raw_inputs[0..1] ], "2 URLs (one containing a comma)", ); is_deeply( _recognize_limited_urls($inputs[2], 3), [ @raw_inputs[0..2] ], "3 URLs (one containing a comma)", ); is_deeply( _recognize_limited_urls($inputs[3], 3), [ @raw_inputs[0..2] ], "Still only 3 URLs (one containing a comma); reject those over max +", ); done_testing(); sub _recognize_limited_urls { my ($input, $max) = @_; my $str = $input; my $pattern = qr/^(http.*?)(?:,(http.*?))?$/; my $count = 0; my @captures = (); LOOP: while ($count < $max) { my ($capture, $balance); if ($str and $str =~ m/$pattern/) { ($capture, $balance) = ($1, $2); push @captures, $capture; $str = $balance; $count++; } else { last LOOP; } } return \@captures; }

Thank you very much.

Jim Keenan


In reply to Capturing substrings with complex delimiter, up to a maximum by jkeenan1

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.