in reply to Re: Regex libraries
in thread Regex libraries

Well, as osunderdog already stated:

Every feature impedes performance

I would also be very keen on features. Especially if you have the expertise to apply them and can do more intelligent regexes which do take longer but are more accurate.
As we will use this filter in a time critical application and portability is not an issue the developer had his focus on performance.

The whole process of updating the filter with a new or an additional regex library would cost time and money so we have to think of that pros and cons first. But I will at least propose to keep PCRE in mind and if we get to the library limits we should switch or add the PCRE.

Out of curiosity, which compilers/platforms are you currently supporting with boost?

Unfortunatly I dont know that directly. But I can say that it will run on on several x86 machines running Suse Linux. The communication is realised in CORBA.

I just reread this documentation and am really suprised. Because my version doesnt allow this. Unfortunatly I cant ask the developer why this is the case because he is on vacation.

I currently checked and the following expressions are producing an error for me:

  • ba{2}
  • (?:abc)
  • (?=abc)
  • (?!abc)
  • (?>expression)
  • .*? and any varation of quantifier
  • I think I should first check this out and bug you again if I know more. Sorry for that as this seems to be a software error and has nothing to do with boost.

    Replies are listed 'Best First'.
    Re^3: Regex libraries
    by Anonymous Monk on Dec 30, 2004 at 14:49 UTC
      According to the Boost.Regex Regular Expression Syntax, all the constructs you checked are supported. From Boost.Regex Standards Conformance, we learn that the unsupported Perl features are:
      1. \N{name} - but one can use [:name:]
      2. \pP and \PP
      3. (?imsx-imsx)
      4. Lookbehind
      5. (?{ }) and (??{ })
      6. (?(condition)yes-pattern) and (?(condition)yes-pattern|no-pattern)
      That in my opinion isn't too bad. For point 1, there is an alternative, so no loss of functionality. Also note that one reason \N isn't supported is that \N isn't a regex construct - it's a string interpolation thing. PCRE doesn't support \N either.

      I've written a lot of regexes, and seen even more, but I've never had the need to use the constructs of point 2, and I've never seen them used either. In PCRE support for \p and \P is limited, and only available if specially build with Unicode character property support. (PCRE does not have full UTF-8 support).

      Point 3 might be a nuisance, but personally I've never used them to set flags for parts of the expressions - I only use them implicitely when interpolating a qr construct - a feature that can not be handled by a library. And with boost, you can set many flags when matching, even more flags than with Perl. Specifically, one can set flags that mimic /m and /s. (Or rather, one needs to set flags to turn the /m and /s behaviours off, if I understand the page correctly). /i can be achieved by first lowercasing what you are matching against. There doesn't seem to be an equivalent to /x, but we were happy without it for years in Perl as well, and /x doesn't provide functionality - just readability.

      Lookbehind in Perl regexes is fairly limited anyway, as you can only match against fixed width strings. Sure, it's a miss, but it's a miss of a limited thing.

      Not being able to execute code isn't a limitation of the library - it's a limitation because it's a library. Only if regexes are an integral part of the language is such a thing possible, as it requires access to the variables. PCRE doesn't support (?{...}) and (??{...}) either, although it does have some other features that might do what you want to do with those coding constructs.

      The last point is a miss because you can't use (?(?{...})yes|no), but then you are using code again, and that wouldn't be possible anyway due to the previous argument.

      Note also that the last two points are still marked as highly experimental, and p5p reserves the right to remove or change them without any notice.

      Note that I base this purely on what Boost and PCRE say about themselves on their web and manual pages. I've never used any of the libraries myself.

        Thanks

        That is very interesting to know. Its good to know that presumably it wonīt affect me. Although reading the documentation I couldnīt get much out of it as I wasnt understanding the regexes itself.
        Thanks for translating these to me in an understandable language :-)

        While testing our software I ran into the mentioned bugs and falsely blamed Boost for that.

        Its also interesting to see that Boost and PCRE have quite much in common. The author of Boost made also a statement, why he didnt stick to Perl5 regexes. He based his position on an article by Larry Wall, where Larry states that perl6 will probably have quite huge changes in the regex syntax and so Perl5 hasnt in fact a real standart. Sounds to me like Perl6 is reinventing regex!?

        I commit I havent read the article completly yet but do you guys know if there /will be|is/ a book about Perl6 Regex syntax?

          The author of Boost is quite right. Perl doesn't have a standard, and regexes in Perl6 will be different (although there's a promise Perl5 style regexes will work as well).

          Read about the Perl6 regexes in Apocalypse 5: Pattern Matching.

          There's no book about Perl6 regexes yet - but no doubt O'Reilly or some other publisher is willing to publish one as soon as someone can convince them they could write a good book about them. (Perl publishers will be very happy with Perl6 - it gives them the opportunity to sell their books again).