Re: Re: Question Regarding Regular Expressions and Negated Character Classes

Replies are listed 'Best First'.
Re: Re: Re: Question Regarding Regular Expressions and Negated Character Classes by thpfft (Chaplain) on Jul 14, 2002 at 18:59 UTC
it depends how important it is that you get every link, i suppose. Using the parser has the great advantage that you don't have to worry about anything to do with spacing, attribute order or case, so you won't miss any links. it's very efficient, too, and most of all it lets you make decisions in perl rather than in (?:^\.^\) In your regex, for example, there's nothing to detect a </noframes> tag, and it's only convention that puts the noframes content at the end of the page. Adding the test won't be pretty. But if you want something quick and dirty that gets most of the links and mostly the right ones, then I suppose you could use regexes. Some pointers: `[^netscape\.com\|^microsoft\.com]` [download] should probably be done with zero-width negative lookaheads: `(?!netscape.com)(?!microsoft.com)` [download] And this: `http://(?:.?)` [download] apart from being rather inefficient (consider instead `(?:\w+\.)?`), is no respecter of levels: in the case of http://microsoft.com it eats the microsoft part before you check whether it's there or not. I'm not sure there's anything you can do about that inside the pattern, except to assume that subdomains have short names or perhaps repeat the test. This works, in a slapdash way: `$page =~ m%<noframes>.?<a href = *"(http://(?!netscape.com)(?!micros +oft.com)(?:\w+\.)(?!netscape.com)(?!microsoft.com)[^"]+)"%is;` [download] But i dread to think how much effort it puts in before discarding the first match, and it's not anything i'd want to try and read. I don't think I would try and pack all the logic into one line. And I'd still urge you to try the parser. that code can be made more compact, if you prefer, and it'll save you lots of grief.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Re: Re: Question Regarding Regular Expressions and Negated Character Classes
by thpfft (Chaplain) on Jul 14, 2002 at 18:59 UTC

it depends how important it is that you get every link, i suppose. Using the parser has the great advantage that you don't have to worry about anything to do with spacing, attribute order or case, so you won't miss any links. it's very efficient, too, and most of all it lets you make decisions in perl rather than in (?:^\*.^\)

In your regex, for example, there's nothing to detect a </noframes> tag, and it's only convention that puts the noframes content at the end of the page. Adding the test won't be pretty.

But if you want something quick and dirty that gets most of the links and mostly the right ones, then I suppose you could use regexes. Some pointers:

[^netscape\.com|^microsoft\.com]
[download]

should probably be done with zero-width negative lookaheads:

(?!netscape.com)(?!microsoft.com)
[download]

And this:

http://(?:.*?)
[download]

apart from being rather inefficient (consider instead (?:\w+\.)?), is no respecter of levels: in the case of http://microsoft.com it eats the microsoft part before you check whether it's there or not. I'm not sure there's anything you can do about that inside the pattern, except to assume that subdomains have short names or perhaps repeat the test. This works, in a slapdash way:

$page =~ m%<noframes>.*?<a href *= *"(http://(?!netscape.com)(?!micros
+oft.com)(?:\w+\.)(?!netscape.com)(?!microsoft.com)[^"]+)"%is;
[download]

But i dread to think how much effort it puts in before discarding the first match, and it's not anything i'd want to try and read. I don't think I would try and pack all the logic into one line.

And I'd still urge you to try the parser. that code can be made more compact, if you prefer, and it'll save you lots of grief.

[reply]
[d/l]
[select]