Re: //s modifier

In the example you give, a regular expression will probably do what you want, because it is very unlikely that a document will contain two TITLE elements. However, in a slightly different example, e.g., if we were looking for certain text in a CAPTION element, then the regular expression that works for your example might fail, if the text in question occurs between two of the elements in question but not within either of them. It is possible to work around that with a much more complicated regular expression, but it's hairy, and it will still fail if the element in question can be nested within itself, either directly or indirectly. In such cases, you really need to use a module that parses the SGML and hands you a DOM. HTML::TreeBuilder and XML::Twig make this sort of thing easy for HTML and XML respectively, and there are various alternatives to them as well. I don't know as much about SGML modules, since I've never worked much with SGML (except for legacy versions of HTML that were SGML-based), but you might check the CPAN.

Of course, if the example you gave is really all you want to do, then you may not need a parser, since the regex will probably be good enough.

Sanity? Oh, yeah, I've got all kinds of sanity. In fact, I've developed whole new kinds of sanity. Why, I've got so much sanity it's driving me crazy.

Comment on Re: //s modifier

Replies are listed 'Best First'.
Re^2: //s modifier by kettle (Beadle) on Mar 22, 2006 at 04:59 UTC
the problem is actually considerably more complex than the example I gave. I decided I'll have to use an SGML parser, as you and the previous poster suggested. Thanks for the regex help and the SGML suggestions! joe	[reply]