Capturing substrings with complex delimiter, up to a maximum

jkeenan1 has asked for the wisdom of the Perl Monks concerning the following question:

Monks:

I have a solution to a pattern matching problem, but I seek the assistance of the monks in finding a more elegant solution.

The input is a string consisting of substrings which are URLs (starting with either 'http' or 'https' delimited, more or less, by commas. I say "more or less" because some of the URLs may themselves contain commas -- a business requirement I don't pretend to defend. Hence, it would be neither sufficient nor correct to split the input string on a comma. We only want to split on commas which precede 'http'.

    $input = q|http://abc.org,http://de,f.org,https://ghi.org|;
[download]

We should be able to capture 3 URLs from the above input string:

    @captures = (
        'http://abc.org',
        'http://de,f.org',
        'https://ghi.org',
    );
[download]

To further complicate the problem, we want to set a maximum number of URLs to be captured from the input string. Any URLs in excess of the maximum (which should be configurable) should not be captured and may be ignored.

Suppose we are to capture only the first two of the URLs in the input string above. In that case, our results would be:

    @captures = (
        'http://abc.org',
        'http://de,f.org',
    );
[download]

I spent several hours on this problem today. I was hoping that I could write one regular expression which would be applied just once to the input string and return the correct URLs. I did not succeed in that, but I managed to write a pattern which, repeatedly applied to the input string within a subroutine, gave me the intended results. The test file which follows works for me -- but can anyone suggest a simpler solution?

use strict;
use warnings;
use 5.010_001;
use Test::More;

my @raw_inputs = (
    'http://abc.org',
    'http://de,f.org',
    'https://ghi.org',
    'http://jkl.org',
);
my @inputs = ( $raw_inputs[0] );
for my $q (1..3) {
    push @inputs, join(',' => @raw_inputs[0..$q]);
}

is_deeply(
    _recognize_limited_urls($inputs[0], 3),
    [ $raw_inputs[0] ],
    "1 URL",
);

is_deeply(
    _recognize_limited_urls($inputs[1], 3),
    [ @raw_inputs[0..1] ],
    "2 URLs (one containing a comma)",
);

is_deeply(
    _recognize_limited_urls($inputs[2], 3),
    [ @raw_inputs[0..2] ],
    "3 URLs (one containing a comma)",
);

is_deeply(
    _recognize_limited_urls($inputs[3], 3),
    [ @raw_inputs[0..2] ],
    "Still only 3 URLs (one containing a comma); reject those over max
+",
);

done_testing();

sub _recognize_limited_urls {
  my ($input, $max) = @_;
  my $str = $input;
  my $pattern = qr/^(http.*?)(?:,(http.*?))?$/;
  my $count = 0;
  my @captures = ();
  LOOP: while ($count < $max) {
    my ($capture, $balance);
    if ($str and $str =~ m/$pattern/) {
      ($capture, $balance) = ($1, $2);
      push @captures, $capture;
      $str = $balance;
      $count++;
    }
    else {
      last LOOP;
    }
  }
  return \@captures;
}
[download]

Thank you very much.

Jim Keenan

Comment on Capturing substrings with complex delimiter, up to a maximum Select or Download Code

Replies are listed 'Best First'.
Re: Capturing substrings with complex delimiter, up to a maximum by BrowserUk (Patriarch) on Oct 29, 2013 at 23:49 UTC
Unless you have a particular unstated reason for not doing so, it is probably more efficient to simply parse out all the urls and then discard those you don't want. Try this: C:\test>p1 sub getNurls{ my($s,$n) = @_; my @urls = $s =~ m[(https?://.+?)(?:,(?=http)\|$)]g; return @urls[ 0 .. $n-1 ]; };; @raw_inputs = ( 'http://abc.org', 'http://de,f.org', 'https://ghi.org', 'http://jkl.org', );; $s = join ',', @raw_inputs;; print for getNurls( $s, 4 );; http://abc.org http://de,f.org https://ghi.org http://jkl.org print for getNurls( $s, 3 );; http://abc.org http://de,f.org https://ghi.org print for getNurls( $s, 2 );; http://abc.org http://de,f.org print for getNurls( $s, 1 );; http://abc.org [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: Capturing substrings with complex delimiter, up to a maximum by jkeenan1 (Deacon) on Oct 30, 2013 at 01:27 UTC
Thanks for your rapid response. I tried to adapt your suggestion to a subroutine which I could easily drop into my test file. I found that at first it did not pass all tests. `sub _browseruk_recognize_limited_urls { my ($input, $max) = @_; my @captures = $input =~ m[(https?://.+?)(?:,(?=http)\|$)]g; return [ @captures[ 0 .. $max-1 ] ]; }` [download] Results: capture.t .. not ok 1 - 1 URL # Failed test '1 URL' # at capture.t line 18. not ok 2 - 2 URLs (one containing a comma) ok 3 - 3 URLs (one containing a comma) ok 4 - Still only 3 URLs (one containing a comma); reject those over m +ax 1..4 # Structures begin differing at: # $got->[1] = undef # $expected->[1] = Does not exist # Failed test '2 URLs (one containing a comma)' # at capture.t line 23. # Structures begin differing at: # $got->[2] = undef # $expected->[2] = Does not exist # Looks like you failed 2 tests of 4. Dubious, test returned 2 (wstat 512, 0x200) Failed 2/4 subtests Test Summary Report ------------------- capture.t (Wstat: 512 Tests: 4 Failed: 2) Failed tests: 1-2 Non-zero exit status: 2 Files=1, Tests=4, 1 wallclock secs ( 0.13 usr 0.04 sys + 0.11 cusr + 0.05 csys = 0.33 CPU) Result: FAIL shell returned 1 [download] However, when I grepped for definedness ... `sub _browseruk_recognize_limited_urls { my ($input, $max) = @_; my @captures = $input =~ m[(https?://.+?)(?:,(?=http)\|$)]g; return [ grep { defined($_) } @captures[ 0 .. $max-1 ] ]; }` [download] ... all tests passed. Thank you very much. Jim Keenan	[reply] [d/l] [select]
Re^3: Capturing substrings with complex delimiter, up to a maximum by BrowserUk (Patriarch) on Oct 30, 2013 at 01:53 UTC
The regex will (can only) return valid matches. It cannot return undef. You should not have to use grep. Thus, it is either your adaption of the code, or your test that is wrong. The only way you can generate undef's with your first implementation, is if `$max` is greater than the number of urls within the string, in which case the slice `@captures[ 0 .. $max-1 ]` will generate undefs. Instead of greping out the extraneous undef's; don't generate them in the first place: `sub _browseruk_recognize_limited_urls { my ($input, $max) = @_; my @captures = $input =~ m[(https?://.+?)(?:,(?=http)\|$)]g; return [ @captures[ 0 .. @captures < $max ? $#captures : $max -1 ] +]; }` [download] But even that is rather silly. You have an array but you want to return an array ref (for some reason?), so you slice the array into a list, wrap it in another (anonymous) array and return a reference to that. If the reason for returning a reference is "efficiency"; you completely blew any potential gain by splicing and listing (never mind the redundant greping). Better to simply return that list and assign it to an array in the caller. But, if you really need a reference, then adjust the size of the array you already have and then return a reference to that: `sub _browseruk_recognize_limited_urls { my ($input, $max) = @_; my @captures = $input =~ m[(https?://.+?)(?:,(?=http)\|$)]g; $#captures = $max -1 if @captures >= $max; ## Adjust size if necess +ary return \@captures. }` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re: Capturing substrings with complex delimiter, up to a maximum by LanX (Saint) on Oct 30, 2013 at 00:22 UTC
Combine split with limit and lookahead-assertion! `DB<107> $str=join ",",@captures => "http://abc.org,http://de,f.org,https://ghi.org" DB<108> split /,(?=https?:)/,$str,1 => "http://abc.org,http://de,f.org,https://ghi.org" DB<109> split /,(?=https?:)/,$str,2 => ("http://abc.org", "http://de,f.org,https://ghi.org") DB<110> split /,(?=https?:)/,$str,3 => ("http://abc.org", "http://de,f.org", "https://ghi.org")` [download] you just need to get rid of the unsplitted rest (if any) `DB<122> $n=1 => 1 DB<123> @array[0..$n-1] = split /,(?=https?:)/ , $str , $n+1 => "http://abc.org" DB<124> $n=2 => 2 DB<125> @array[0..$n-1] = split /,(?=https?:)/ , $str , $n+1 => ("http://abc.org", "http://de,f.org")` [download] Cheers Rolf ( addicted to the Perl Programming Language)	[reply] [d/l] [select]
Re^2: Capturing substrings with complex delimiter, up to a maximum by jkeenan1 (Deacon) on Oct 30, 2013 at 01:33 UTC
Thank you very much for your rapid response. As was the case with BrowserUK's response, I tried to adapt your suggestion to a subroutine I could easily drop into my test program. `sub _rolf_recognize_limited_urls { my ($input, $max) = @_; my @captures = (); @captures[0..$max-1] = split /,(?=https?:)/, $input, $max+1; return [ @captures ]; }` [download] As was the case with BrowserUK's formulation, the first two of four tests failed. I only got all four tests to past when I grepped for definedness: `sub _rolf_recognize_limited_urls { my ($input, $max) = @_; my @captures = (); @captures[0..$max-1] = split /,(?=https?:)/, $input, $max+1; return [ grep { defined($_) } @captures ]; }` [download] Thank you very much. Jim Keenan	[reply] [d/l] [select]
Re: Capturing substrings with complex delimiter, up to a maximum by BillKSmith (Monsignor) on Oct 30, 2013 at 03:46 UTC
Remember that `split` takes a regex as it first argument. Use a lookahead assertion after the comma. `use strict; use warnings; use Data::Dumper; my $input = q\|http://abc.org,http://de,f.org,https://ghi.org,http://jk +l.org\|; my @urls = (split /,(?=http)/, $input, 4)[0..2]; print Dumper \@urls;` [download] Output: `$VAR1 = [ 'http://abc.org', 'http://de,f.org', 'https://ghi.org' ];` [download] Bill	[reply] [d/l] [select]
Re: Capturing substrings with complex delimiter, up to a maximum (fafafa) by Anonymous Monk on Oct 30, 2013 at 00:10 UTC
:) `#!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw/ dd /; my $better = q{fa fa fa fa fa fa fa fa fa fa better}; my $max = 3; my @fa = grep {length} split qr{(fa\S*?\s)}, $better, $max; dd( \@fa ); __END__ ["fa ", "fa ", "fa fa fa fa fa fa fa fa better"]` [download]	[reply] [d/l]
Re: Capturing substrings with complex delimiter, up to a maximum by hdb (Monsignor) on Oct 30, 2013 at 07:39 UTC
My proposal is to first split and capture on http:// or https:// with an optional comma beforehand and then put back together: `sub getNurls { my( $s, $n ) = @_; my @urls = split /,?(https?:\/\/)/, $s; return map { defined $urls[2$_-1]?$urls[2$_-1].$urls[2*$_]:() } +(1..$n); } my $max = 2; my $input = q\|http://abc.org,http://de,f.org,https://ghi.org\|; print "$_\n" for getNurls( $input, $max );` [download] (borrowing notation from prior posts)	[reply] [d/l]
Re^2: Capturing substrings with complex delimiter, up to a maximum by tobyink (Canon) on Oct 30, 2013 at 10:17 UTC
But what if one of the URLs contains "http://"? For example... `http://translator.example.com/?url=http://www.example.fr/` [download] Rather than an optional comma, an improvement would be to look for a comma or `\A`: `/(?:\A\|,)(https?:\/\/)/` [download] `use Moops; class Cow :rw { has name => (default => 'Ermintrude') }; say Cow->new->name`	[reply] [d/l] [select]
Re^3: Capturing substrings with complex delimiter, up to a maximum by hdb (Monsignor) on Oct 30, 2013 at 11:05 UTC
That would be a problem. But can we rule out that a URL contains ",http://"?	[reply]