Schuk has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks,

Im in need of your wisdom. I am currently involved in a project which needs to have a Regex Filter.

As I have joined this project quite late it is realised in C++ . I assume its a matter of speed and interfaces to other software. Anyway, as I dont have the expertise to make decissions there, I take it as it is.
I only have to use this thing.

The developers used as an regex engine Boost. As I was testing this thing I found some limitations:

  • Lookaheads
  • controlling greedy quantifier
  • ... and probably more
  • My Problem is, if it is reasonable to convice my boss to implement a full Perl compatible Regex library?

    Perhaps there is a possibility to simulate the effect of Lookaheads, because I should have good arguments to use PCRE despite of Boost.

    What arguments could I use to prefer a Perl regex?
    I havent found alot of uses for lookaheads yet, but I am afraid that in 2 months I am sitting here with Boost and cant use a Regex which would have been easy with Perl.

    Our developer told me that he already compared pcre to Boost, and boost succeeded because of its speed.

    Hope you can help me

    Schuk

    Replies are listed 'Best First'.
    Re: Regex libraries
    by Anonymous Monk on Dec 29, 2004 at 15:53 UTC

      My Problem is, if it is reasonable to convice my boss to implement a full Perl compatible Regex library?

      Maybe, maybe not. But how should we know? It would entirely depend on what kind of regexes you need. It's kind of hard to convince your boss to assign resources to adding PCRE to your product because "it has lookahead" when you yourself haven't found a lot of uses for it.

      The last thing you should do to have Perl been accepted on the workfloor is to act like a zealot. If Perl is needed, it will come. But if it can be done with Boost, and Boost is faster, then I wouldn't push for anything else - unless I have arguments of my own.

        You´re right.
        If I dont have an important feature which I would miss in the future its quite useless to whine about something I dont have and actually dont know how to apply correctly.
        Perhaps I expressed myself unclear (sorry for that) I do have one or two cases where Lookaheads are usefull but I am still thinking I could also write them in Boost. Unfortunatly I havent found out how yet:

        Perl goes: /(?=.*\.)/

        Boost goes: /[^.]*/

        Actually I just got aware that the second regex without Lookahead is much more usefull because its more flexible. I simply didnt thought well enough about my problem.

        I should take myself more time to think about problems. One week still isnt enough ;-) I probably should wipe my ideas if I havent solved them in one day.
        The obvious often hides behind the complicated.

        Anyhow. I just made the decision to use the library which is currently implemented. I was more frightend to have something I wouldnt be able to work with. And I needed to see this written down somewhere *g

        Thanks for pointing me in the right direction

          /(?=.*\.)/ and /[^.]*/ are not the same thing at all.

          'booga' =~ /[^.]*/ matches. 'booga.' =~ /[^.]*/ matches. 'booga' =~ /(?=.*\.)/ does not match. 'booga.' =~ /(?=.*\.)/ matches. '4.6.8' =~ /[^.]*\.(\d)/ returns 6 in $1. '4.6.8' =~ /(?=.*\.)(\d)/ returns 4 in $1. '4.6.8' =~ /((?=.*\.))\1(\d)/ returns 8 in $2. .* matched a "4.6" '4.6.8' =~ /.*\.(\d)/ returns 8 in $1. Lookahead not needed. '4.6.8' =~ /(?:(?!\.).)*(\d)/ returns 6 in $1. You want neg lookahead

          /(?=.*\.)/ requires a "." in the string. /[^.]*/ does not requires a "." in the string. /[^.]*/ is equivalent of the negative lookahead /(?:(?!\.).)*/. The advantage of the negative lookahead version is that you can negatively match more than one character: /(?:(?!$re).)*/. If I only wanted to negatively match one character, I'd use [^...], even in Perl.

          Side thought: Hum, It would be nice if /(?:(?!$re).)*/ could be shortcutted to /(?^$re)/, since that's the typical use of a negative lookahead.

    Re: Regex libraries
    by osunderdog (Deacon) on Dec 29, 2004 at 15:53 UTC

      Doesn't sound like you have a great case. You might have a better case in 2 months when you need a feature of PCRE that isn't available in Boost. :) Only suggestion I would make (if you think it's the right thing to do) is to propose an thin abstraction layer over Boost/PCRE that would allow you to switch between Boost and PCRE where needed.

      Every feature impedes performance.


      "Look, Shiny Things!" is not a better business strategy than compatibility and reuse.


      OSUnderdog

      Considered by osunderdog: "Slightly OT. Move to Meditations".
      Unconsidered by davido: This followup would be out of context if it got moved to Meditations without the rest of the thread. Final vote (keep/edit/delete): 5/1/0.

    Re: Regex libraries
    by eyepopslikeamosquito (Archbishop) on Dec 29, 2004 at 22:38 UTC

      It depends on your requirements. You didn't state them explicitly but presumably performance is more important to you than portability and features. PCRE is much more portable than boost, is very widely used in high profile products (PHP, Python, Apache, ...), indeed it's a de facto standard.

      One possibility is to support both boost and PCRE. As for resources required to add PCRE, I added PCRE to our cross-platform library (11 different platforms) in, oh, a day or so. It is ANSI C and built and ran fine on all 11 platforms with no problems. Compare that to stlport which has been a portability nightmare for us, taking months of effort. Now, boost is a fine library, written by ANSI C++ experts ... but are they regex experts?

      Out of curiosity, which compilers/platforms are you currently supporting with boost?

      Update:

      As I was testing this thing I found some limitations: Lookaheads, controlling greedy quantifier, ...

      I've never used the boost regex library but am keenly interested in it and just took a look at its doco, which states that it supports both perl5-style lookahead assertions (see "Forward Lookahead Asserts" section) and non-greedy quantifiers (see "Non-greedy repeats" section). Can you give more details of the problems you were having with these?

        Well, as osunderdog already stated:

        Every feature impedes performance

        I would also be very keen on features. Especially if you have the expertise to apply them and can do more intelligent regexes which do take longer but are more accurate.
        As we will use this filter in a time critical application and portability is not an issue the developer had his focus on performance.

        The whole process of updating the filter with a new or an additional regex library would cost time and money so we have to think of that pros and cons first. But I will at least propose to keep PCRE in mind and if we get to the library limits we should switch or add the PCRE.

        Out of curiosity, which compilers/platforms are you currently supporting with boost?

        Unfortunatly I dont know that directly. But I can say that it will run on on several x86 machines running Suse Linux. The communication is realised in CORBA.

        I just reread this documentation and am really suprised. Because my version doesnt allow this. Unfortunatly I cant ask the developer why this is the case because he is on vacation.

        I currently checked and the following expressions are producing an error for me:

      • ba{2}
      • (?:abc)
      • (?=abc)
      • (?!abc)
      • (?>expression)
      • .*? and any varation of quantifier
      • I think I should first check this out and bug you again if I know more. Sorry for that as this seems to be a software error and has nothing to do with boost.

          According to the Boost.Regex Regular Expression Syntax, all the constructs you checked are supported. From Boost.Regex Standards Conformance, we learn that the unsupported Perl features are:
          1. \N{name} - but one can use [:name:]
          2. \pP and \PP
          3. (?imsx-imsx)
          4. Lookbehind
          5. (?{ }) and (??{ })
          6. (?(condition)yes-pattern) and (?(condition)yes-pattern|no-pattern)
          That in my opinion isn't too bad. For point 1, there is an alternative, so no loss of functionality. Also note that one reason \N isn't supported is that \N isn't a regex construct - it's a string interpolation thing. PCRE doesn't support \N either.

          I've written a lot of regexes, and seen even more, but I've never had the need to use the constructs of point 2, and I've never seen them used either. In PCRE support for \p and \P is limited, and only available if specially build with Unicode character property support. (PCRE does not have full UTF-8 support).

          Point 3 might be a nuisance, but personally I've never used them to set flags for parts of the expressions - I only use them implicitely when interpolating a qr construct - a feature that can not be handled by a library. And with boost, you can set many flags when matching, even more flags than with Perl. Specifically, one can set flags that mimic /m and /s. (Or rather, one needs to set flags to turn the /m and /s behaviours off, if I understand the page correctly). /i can be achieved by first lowercasing what you are matching against. There doesn't seem to be an equivalent to /x, but we were happy without it for years in Perl as well, and /x doesn't provide functionality - just readability.

          Lookbehind in Perl regexes is fairly limited anyway, as you can only match against fixed width strings. Sure, it's a miss, but it's a miss of a limited thing.

          Not being able to execute code isn't a limitation of the library - it's a limitation because it's a library. Only if regexes are an integral part of the language is such a thing possible, as it requires access to the variables. PCRE doesn't support (?{...}) and (??{...}) either, although it does have some other features that might do what you want to do with those coding constructs.

          The last point is a miss because you can't use (?(?{...})yes|no), but then you are using code again, and that wouldn't be possible anyway due to the previous argument.

          Note also that the last two points are still marked as highly experimental, and p5p reserves the right to remove or change them without any notice.

          Note that I base this purely on what Boost and PCRE say about themselves on their web and manual pages. I've never used any of the libraries myself.