jayto has asked for the wisdom of the Perl Monks concerning the following question:

Evening Monks, I come to you with another regex problem.
I am trying to parse Documents like this:

ICS-CERT ADVISORY ICSA-11-273-03A—ROCKWELL RSLOGIX DENIAL-OF-SERVICE VULNERABILITY October 06, 2011 OVERVIEW This Updated Advisory is a follow-up to the original Advisory titled “ +ICSA-11-273-03—Rockwell RSLogix denial-of-Service Vulnerability” that was published September +30, 2011 on the ICS-CERT web page. ICS-CERT is aware of a public report of a denial-of-service vulnerabil +ity in Rockwell Automation’s RSLogix application. --------- Begin Update X Part 1 of 2 -------- Rockwell has produced a patch that mitigates this vulnerability for al +l affected versions of FactoryTalk Services Platform and RSLogix 5000. --------- End Update X Part 1 of 2 ---------- AFFECTED PRODUCTS According to Rockwell Automation, the following products are affected: + • RSLogix 5000 software Versions V17, V18, and V19 • All FactoryTalk-branded software of specific Versions CPR9 and CPR9- +SR1 through SR4. IMPACT Successful exploitation of this vulnerability could result in a denial +-of-service. Impact to individual organizations depends on many factors that are un +ique to each organization. ICS-CERT recommends that organizations evaluate the impact of this vul +nerability based on their operational environment, architecture, and product implementation. BACKGROUND Rockwell Automation provides industrial automation control and informa +tion products worldwide, across a wide range of industries. RSLogix 5000 is a programming suite used to develop interfaces within +the control system environment. The FactoryTalk Services Platform is a collection of production and pe +rformance management systems. ICS-CERT Advisory ICSA-11-273-03A Page 1 of 3 VULNERABILITY CHARACTERIZATION VULNERABILITY OVERVIEW A Read Access violation can occur when a specially crafted packet is s +ent to open ports running the software. The open TCP ports are as follows: • 1330 • 4242 • 6543 • 1331 • 4445 • 9111 • 1332 • 4446 • 60093 • 4241 • 5241 • 49281 a A CVE-2011-3489 has been assigned to this vulnerability in the National +Vulnerability Database (NVD). CVSS base score of 5.0 has been assigned. VULNERABILITY DETAILS EXPLOITABILITY This vulnerability is remotely exploitable. EXISTENCE OF EXPLOIT Public exploits are known to target this vulnerability. DIFFICULTY An attacker with a low skill level can create the denial-of-service. MITIGATION --------- Begin Update X Part 2 of 2 -------- Rockwell Automation recommends that concerned customers using FactoryT +alk Services Platform Versions CPR9 and CPR9-SR1 through SR4 and customers using RSLogix ver +sions V17, V18, and V19 b apply patch AID 458689. Customers using FactoryTalk Services Platform CPR7 and earlier, and RS +Logix 5000 V16 and earlier, are not affected by this vulnerability. For full patching instructions and additional information, refer to Ro +ckwell Automation Security Advisory KB 456144. http://rockwellautomation.custhelp.com/app/answers/detail/a_id/456144. + --------- End Update X Part 2 of 2 ---------- a . http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2011-3489, websi +te last accessed October 06, 2011. b. http://rockwellautomation.custhelp.com/app/answers/detail/a_id/4586 +89, website last accessed October 06, 2011. ICS-CERT Advisory ICSA-11-273-03A Page 2 of 3 ICS-CERT encourages asset owners to take additional defensive measures + to protect against this and other cybersecurity risks. • Minimize network exposure for all control system devices. Critical d +evices should not directly face the Internet. • Locate control system networks and remote devices behind firewalls, +and isolate them from the business network. • When remote access is required, use secure methods such as Virtual P +rivate Networks (VPNs), recognizing that VPN is only as secure as the connected devices. The Control Systems Security Program (CSSP) also provides a section fo +r control system security recommended practices on the CSSP web page. Several recommended practi +ces are available for reading and download, including Improving Industrial Control Systems Cybersecu +rity with Defense-in-Depth c Strategies. ICS-CERT reminds organizations to perform proper impact analysis and r +isk assessment prior to taking defensive measures. Organizations observing any suspected malicious activity should follow + their established internal procedures and report their findings to ICS-CERT for tracking and corr +elation against other incidents. ICS-CERT CONTACT For any questions related to this report, please contact ICS-CERT at: ics-cert@dhs.gov E-mail: Toll Free: 1-877-776-7585 For CSSP Information and Incident Reporting: www.ics-cert.org DOCUMENT FAQ What is an ICS-CERT Advisory? An ICS-CERT Advisory is intended to prov +ide awareness or solicit feedback from critical infrastructure owners and operators concerning +ongoing cyber events or activity with the potential to impact critical infrastructure computing network +s When is vulnerability attribution provided to researchers? Attribution + for vulnerability discovery is provided when prior coordination has occurred with either + the vendor, ICS-CERT, or other coordinating entity. ICS-CERT encourages researchers to coordinate vul +nerability details before public release. The public release of vulnerability details prior to the deve +lopment of proper mitigations may put industrial control systems (ICSs) and the public at avoidable risk. c. CSSP Recommended Practices, http://www.us-cert.gov/control_systems/ +practices/Recommended_Practices.html, website last accessed October 06, 2011. ICS-CERT Advisory ICSA-11-273-03A Page 3 of 3

I am trying to pull out all the data, at the top of the document, between "OVERVIEW" and "AFFECTED PRODUCTS"


This is my regular expression (assume $stdout equals the above content above) :

if( $stdout =~ /(?:OVERVIEW|SUMMARY)(.+)(?:AFFECTED\sPRODUCTS|BACKGROU +ND)/s ) { print "$1\n"; }

...and this is my output

This Updated Advisory is a follow-up to the original Advisory titled “ +ICSA-11-273-03—Rockwell RSLogix denial-of-Service Vulnerability” that was published September +30, 2011 on the ICS-CERT web page. ICS-CERT is aware of a public report of a denial-of-service vulnerabil +ity in Rockwell Automation’s RSLogix application. --------- Begin Update X Part 1 of 2 -------- Rockwell has produced a patch that mitigates this vulnerability for al +l affected versions of FactoryTalk Services Platform and RSLogix 5000. --------- End Update X Part 1 of 2 ---------- AFFECTED PRODUCTS According to Rockwell Automation, the following products are affected: + • RSLogix 5000 software Versions V17, V18, and V19 • All FactoryTalk-branded software of specific Versions CPR9 and CPR9- +SR1 through SR4. IMPACT Successful exploitation of this vulnerability could result in a denial +-of-service. Impact to individual organizations depends on many factors that are un +ique to each organization. ICS-CERT recommends that organizations evaluate the impact of this vul +nerability based on their operational environment, architecture, and product implementation.

I wanted the output to be:

This Updated Advisory is a follow-up to the original Advisory titled “ +ICSA-11-273-03—Rockwell RSLogix denial-of-Service Vulnerability” that was published September +30, 2011 on the ICS-CERT web page. ICS-CERT is aware of a public report of a denial-of-service vulnerabil +ity in Rockwell Automation’s RSLogix application. --------- Begin Update X Part 1 of 2 -------- Rockwell has produced a patch that mitigates this vulnerability for al +l affected versions of FactoryTalk Services Platform and RSLogix 5000. --------- End Update X Part 1 of 2 ----------

What am I doing wrong?

Replies are listed 'Best First'.
Re: Regex problem
by frozenwithjoy (Priest) on Jul 31, 2012 at 21:53 UTC
    Using the ol' flip-flop/Range Operators worked for me:
    while (<>) { if ( / ^ OVERVIEW \s $/x .. / ^ AFFECTED\ PRODUCTS \s $/x ) { print $_ unless /( ^ OVERVIEW \s $)|( ^ AFFECTED\ PRODUCTS \s +$)/x; } }

    I added spaces to the regex because each line ends with a space. Here is the output:

    This Updated Advisory is a follow-up to the original Advisory titled “ +ICSA-11-273-03—Rockwell RSLogix denial-of-Service Vulnerability” that was published September +30, 2011 on the ICS-CERT web page. ICS-CERT is aware of a public report of a denial-of-service vulnerabil +ity in Rockwell Automation’s RSLogix application. --------- Begin Update X Part 1 of 2 -------- Rockwell has produced a patch that mitigates this vulnerability for al +l affected versions of FactoryTalk Services Platform and RSLogix 5000. --------- End Update X Part 1 of 2 ----------

    EDIT -- a cleaner version:

    while (<>) { if ( / ( ^ OVERVIEW \s $ ) /x .. / ( ^ AFFECTED\ PRODUCTS \s $ ) /x ) { chomp; say $_ unless $_ eq $1; } }
Re: Regex problem
by ww (Archbishop) on Jul 31, 2012 at 22:38 UTC

    Another way...

    #!/usr/bin/perl use 5.014; # 984660 my $stdout; # I hate this var name; too confuse-able... but it's you +r given my (@string) = <DATA>; for $_(@string) { $stdout .= $_; } if ( $stdout =~ / (?:OVERVIEW) # find start point (and KISS!) (.+) # match pretty much anything (?=AFFECTED\sPRODUCTS) # up to AFFECTED PRODUCTS -- USING A L +OOKAHEAD, (?=...) /xs # extended notation, single line mode (ie, + . matches newlines) ) # close the conditional - not part of rege +x { print "$1\n"; } __DATA__ ...

    Output:

    This Updated Advisory is a follow-up to the original Advisory titled ô +ICSA-11-273-03ůRockwell RSLogix denial-of-Service Vulnerabilityö that was published September +30, 2011 on the ICS-CERT web page. ICS-CERT is aware of a public report of a denial-of-service vulnerabil +ity in Rockwell AutomationĆs RSLogix application. --------- Begin Update X Part 1 of 2 -------- Rockwell has produced a patch that mitigates this vulnerability for al +l affected versions of FactoryTalk Services Platform and RSLogix 5000. --------- End Update X Part 1 of 2 ----------
Re: Regex problem
by aaron_baugher (Curate) on Aug 01, 2012 at 00:20 UTC
    if( $stdout =~ /(?:OVERVIEW|SUMMARY)(.+)(?:AFFECTED\sPRODUCTS|BACKGROU +ND)/s ) { print "$1\n"; }

    You said you want to pick out everything between "OVERVIEW" and "AFFECTED PRODUCTS", so why does your regex also look for "SUMMARY" and "BACKGROUND"?

    Here's what's happening: Your first grouping looks for OVERVIEW or SUMMARY. Your last grouping looks for AFFECTED PRODUCTS or BACKGROUND. Between those, your capture of (.+) is greedy, so it will try to match as much as possible. So your first grouping matches OVERVIEW, then your capture greedily matches all the way to the end of the string, then starts working backwards until it finds a point where your last grouping can match. Since BACKGROUND comes later in the string than AFFECTED PRODUCTS, BACKGROUND gets matched.

    To put it another way, by matching BACKGROUND instead of AFFECTED PRODUCTS, the regex is able to give the longest possible string to your greedy capture in the middle. To fix it, the short answer is to make that captured match non-greedy by changing it to (.+?). A better answer would require understanding why you have those other words in there if you don't need to match them.

    Aaron B.
    Available for small or large Perl jobs; see my home node.

      Because the documents that I am parsing either have an AFFECTED section, or a BACKGROUND section... I didn't realize that the ones with AFFECTED PRODUCT also have BACKGROUND sections.

      Thank you everyone for your help.
Re: Regex problem
by Anonymous Monk on Jul 31, 2012 at 21:35 UTC
Re: Regex problem
by davido (Cardinal) on Jul 31, 2012 at 21:39 UTC

    Update: It appears that this is not an accurate assessment as to what is going wrong. When will I learn to not get involved in regex discussions without testing my theory? ;) This time I'm going to use the excuse that having my backspace key fall off this afternoon has me a little discombobulated. ...off to the repair shop I guess. ;)

    Now sit tight a few minutes and watch for someone else to post a correct answer, below.....

    ?

    (I'm not asking a question.)

    Your regex contains the submatch construct (.+), which will greedily try to match as much as possible. That means that if the (?:AFFECTED......) submatch can possibly match in several places within the target text, the last one will be the one used, and everything between gets sucked into the greedy .+.

    A non-greedy version of .+ is .+?, which provides a cue to the little engine that could(nt) that it needs to match as little as possible instead of as much as possible. Non-greedy matching is not without its own difficulties, but (untested) in this case I think it will help you.

    /(?:OVERVIEW|SUMMARY)(.+?)(?:AFFECTED\sPRODUCTS|BACKGROUND)/s # The change is here----^

    See any of perlrequick, perlretut, or perlre for an explanation of greedy/non-greedy matching.


    Dave