Regex help

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

http://www.chi.com/business/world/us-microsoft-reorganization.story
http://www.chi.com/business/sns-rt-us-markets-stocks.story
http://www.chi.com/business/sns-rt-us-bank-capital-us.story
http://www.chi.com/business/sns-rt-us-bank-capital.story
[download]

Please tell me how to select only two levels from the above url's ^https?://www.chi.com[^/]*/([A-z0-9]*/?[A-z0-9]*).* business/world or business and should not match the word which has "-".

Comment on Regex help Select or Download Code

Replies are listed 'Best First'.
Re: Regex help by daxim (Curate) on Jul 12, 2013 at 10:35 UTC
Don't use regex when a specialised parsing module exists. `use 5.010; use URI; use List::MoreUtils qw(first_index); for my $uri (qw( http://www.chi.com/business/world/us-microsoft-reorganization.stor +y http://www.chi.com/business/sns-rt-us-markets-stocks.story http://www.chi.com/business/sns-rt-us-bank-capital-us.story http://www.chi.com/business/sns-rt-us-bank-capital.story )) { say $uri; my @segments = grep { $_ } URI->new($uri)->path_segments; say for splice @segments, 0, (first_index { /-/ } @segments); }` [download]	[reply] [d/l]
Re: Regex help by rjt (Curate) on Jul 12, 2013 at 10:32 UTC
Please tell me how to select only two levels from the above url's Be careful with this; many sites' terms of service prohibit web scraping. In short, though, there's no need to try to write a regexp to conform to the (rather more complicated than you might think) URI specification. Just use URI.	[reply]
Re: Regex help by Monk::Thomas (Friar) on Jul 12, 2013 at 15:10 UTC
Disclaimer: I don't recommend manual URL parsing at all, since URLs can get quite hairy. Don't use a regexp at all? 'split' may or may not be more appropriate. It depends on how you want to process the URL parts later on. `my @url = split / \/+ /xms, $url;` [download] $url[0] = http: $url[1] = www.chi.com $url[2] = business	[reply] [d/l]