in reply to Quick Question about checking what the regex does

Perl Regex Tester care of davido
  • Comment on Re: Quick Question about checking what the regex does

Replies are listed 'Best First'.
Re^2: Quick Question about checking what the regex does
by davido (Cardinal) on Aug 20, 2012 at 09:23 UTC

    Unfortunately one of the countermeasures against abuse is that the regex tester limits the size of the regular expression and of the test data. I think the limit is set somewhere around 1k each. The OP's regular expression (and probably his data set) will exceed that limitation.

    There is a github repo where one could fetch the code and make a simple modification to SafeMatchStats.pm (lines 25 and 26) to set an arbitrarily large limit. Then ensure that all dependencies listed in Makefile.PL are installed (Skip Plack, as it's only used by the cloud service and not required to run locally). Finally, run as ./retester daemon. The repo is at https://github.com/daoswald/retester.

    If you have the data that you intend to run against this regular expression, you can iterate over the matches using ${^MATCH} to tell you what matched each time.

    However, I was curious, and in the absence of the original data, I took a shot at unwinding the OP's regex by backing out the possible alternation paths:

    use strict; use warnings; my $fcount; my $data; $data = do{ local $/ = undef; <DATA>; }; $fcount = () = $data =~ m/ outlook\s+for\s+any\s+rating | (?: rating | if\s+on\s+negative | Microsoft | suggesting\s+an | may\s+contain\s+statements\s+about\s+future\s+events\, | business\s+conditions\s+and\s+the ) \s+outlook|guidance\s+ (?:to\s+approve|facility) | (?: authoritative | revenue\s+recognition | invaluable\s+practical | valuable | regulatory | technical | under\s+the | staff\'s | judicial | SEC | FDA | Treasury (?:\s+Department)? | specific | implementation | their | government | any\s+ruling | college | absent | \s+his | interim | intrepretive | transition | administrative | procedural | related | applicable | accounting | definitive | superceding | IRS | Internal\s+Revenue\s+Service | valued | EITF\s+accounting ) \s+guidance | guidance\s+ (?: and\s+rules | promulgated(?:\s+thereunder)? |in\s+SFAS ) | (?:provided|issued) \s+by\s+ (?:the\s+)? (?: SEC | Securities\s+and\s+Exchange\s+Commission | Internal\s+Revenue\s+Service | Secretary | United\s+States | Financial\s+Accounting ) | (?:other|applicable) \s+guidance\s+issued | according\s+to\s+the\s+guidance\s+contained | provide\s+guidance\s+to\s+directors | receiving\s+guidance | (?:current|other)\s+guidance\s+(?:under|from) | assumes\s+guidance\s+of\s+ (?:the|a)\s+ (?: company | board | talented\s+team | compensation ) | guidance\s+(?:system|software|technology) /xig; print $fcount, "\n"; __DATA__ outlook for any rating rating if on negative Microsoft suggesting an may contain statements about future events, business conditions and the rating outlook to approve if on negative outlook to approve Microsoft outlook to approve suggesting an outlook to approve may contain statements about future events, outlook to approve business conditions and the outlook to approve rating guidance to approve if on negative guidance to approve Microsoft guidance to approve suggesting an guidance to approve may contain statements about future events, guidance to approve business conditions and the guidance to approve rating outlook facility if on negative outlook facility Microsoft outlook facility suggesting an outlook facility may contain statements about future events, outlook facility business conditions and the outlook facility rating guidance facility if on negative guidance facility Microsoft guidance facility suggesting an guidance facility may contain statements about future events, guidance facility business conditions and the guidance facility authoritative guidance revenue recognition guidance invaluable practical guidance valuable guidance regulatory guidance technical guidance under the guidance staff's guidance judicial guidance SEC guidance FDA guidance Treasury Department guidance Treasury guidance specific guidance implementation guidance their guidance government guidance any ruling guidance college guidance absent guidance his guidance interim guidance intrepretive guidance transition guidance administrative guidance procedural guidance related guidance applicable guidance accounting guidance definitive guidance superceding guidance IRS guidance Internal Revenue Service guidance valued guidance EITF accounting guidance guidance and rules guidance promulgated thereunder guidance promulgated guidance in SFAS provided by SEC provided by the SEC issued by SEC issued by the SEC provided by Securities and Exchange Commission provided by the Securities and Exchange Commission issued by Securities and Exchange Commission issued by the Securities and Exchange Commission provided by Internal Revenue Service provided by the Internal Revenue Service issued by Internal Revenue Service issued by the Internal Revenue Service provided by Secretary provided by the Secretary issued by Secretary issued by the Secretary provided by United States provided by the United States issued by United States issued by the United States provided by Financial Accounting provided by the Financial Accounting issued by Financial Accounting issued by the Financial Accounting other guidance issued applicable guidance issued according to the guidance contained provide guidance to directors receiving\s+guidance current guidance under current guidance from other guidance under other guidance from assumes the guidance of the company assumes the guidance of a company assumes the guidance of the board assumes the guidance of a board assumes the guidance of the talented team assumes the guidance of a talented team assumes the guidance of the compensation assumes the guidance of a compensation guidance system guidance software guidance technology

    I must have gotten a few of the branches slightly wrong because it's only matching 99 of the 113 strings that I listed here. Oh, and it would actually match much more than that. The regex uses "\s+" (meaning one or more whitespaces), and in my input strings I replaced "one or more" with just "one".


    Dave