Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

http://www.chi.com/business/world/us-microsoft-reorganization.story http://www.chi.com/business/sns-rt-us-markets-stocks.story http://www.chi.com/business/sns-rt-us-bank-capital-us.story http://www.chi.com/business/sns-rt-us-bank-capital.story
Please tell me how to select only two levels from the above url's ^https?://www.chi.com[^/]*/([A-z0-9]*/?[A-z0-9]*).* business/world or business and should not match the word which has "-".

Replies are listed 'Best First'.
Re: Regex help
by daxim (Curate) on Jul 12, 2013 at 10:35 UTC
    Don't use regex when a specialised parsing module exists.
    use 5.010; use URI; use List::MoreUtils qw(first_index); for my $uri (qw( http://www.chi.com/business/world/us-microsoft-reorganization.stor +y http://www.chi.com/business/sns-rt-us-markets-stocks.story http://www.chi.com/business/sns-rt-us-bank-capital-us.story http://www.chi.com/business/sns-rt-us-bank-capital.story )) { say $uri; my @segments = grep { $_ } URI->new($uri)->path_segments; say for splice @segments, 0, (first_index { /-/ } @segments); }
Re: Regex help
by rjt (Curate) on Jul 12, 2013 at 10:32 UTC
    Please tell me how to select only two levels from the above url's

    Be careful with this; many sites' terms of service prohibit web scraping. In short, though, there's no need to try to write a regexp to conform to the (rather more complicated than you might think) URI specification. Just use URI.

Re: Regex help
by Monk::Thomas (Friar) on Jul 12, 2013 at 15:10 UTC

    Disclaimer: I don't recommend manual URL parsing at all, since URLs can get quite hairy.

    Don't use a regexp at all? 'split' may or may not be more appropriate. It depends on how you want to process the URL parts later on.

    my @url = split / \/+ /xms, $url;
    $url[0] = http:
    $url[1] = www.chi.com
    $url[2] = business