Hello Thank you all so much for the helpful suggestions. I will need some time to fully digest them since I am still learning perl :)

Basically, my script identifies the number of false positive words related to management guidance. I need to do this so I don't have to go through all the financial filings. So through manual processing, I figured out words that seem to be related to guidance but do not have anything to do with the actual guidance. The regex code that I posted is that list that I compiled. By counting the number of false positive words I know that this filing is irrelevant and I will not have read it later for processing.

I have changed the code a bit and used File::Map to speed it up but I am not sure if I am doing it right. Also, someone asked if the regex worked. Yes, regex works but it is slow and I am trying to make it faster.

map_file my($data), $filing; $fcount=()=$data=~m/outlook\s+for\s+any\s+rating|(?:rating|if\s+on\ +s+negative|Microsoft|suggesting\s+an|may\s+contain\s+statements\s+abo +ut\s+future\s+events\,|business\s+conditions\s+and\s+the)\s+outlook|g +uidance\s+(?:to\s+approve|facility) |(?:authoritative|revenue\s+recognition|invaluable\s ++practical|valuable|regulatory|technical|under\s+the|staff\'s|judicia +l|SEC|FDA|Treasury(?:\s+Department)?|specific|implementation|their|go +vernment|any\s+ruling|college|absent|\s+his|interim|intrepretive|tran +sition|administrative|procedural|related|applicable|accounting|defini +tive|superceding|IRS|Internal\s+Revenue\s+Service|valued|EITF\s+accou +nting)\s+guidance |guidance\s+(?:and\s+rules|promulgated(?:\s+thereund +er)?|in\s+SFAS)|(?:provided|issued)\s+by\s+(?:the\s+)?(?:SEC|Securiti +es\s+and\s+Exchange\s+Commission|Internal\s+Revenue\s+Service|Secreta +ry|United\s+States|Financial\s+Accounting) |(?:other|applicable)\s+guidance\s+issued|according\ +s+to\s+the\s+guidance\s+contained|provide\s+guidance\s+to\s+directors +|receiving\s+guidance |(?:current|other)\s+guidance\s+(?:under|from)|assum +es\s+guidance\s+of\s+(?:the|a)\s+(?:company|board|talented\s+team|com +pensation)|guidance\s+(?:system|software|technology) /xig;

I am also attaching some sample text

http://sec.gov/Archives/edgar/data/1011737/0001193125-06-122041.txt

http://sec.gov/Archives/edgar/data/1012270/0001104659-07-059430.txt

http://sec.gov/Archives/edgar/data/1016281/0001104659-03-016871.txt

http://sec.gov/Archives/edgar/data/1166036/0001104659-09-021080.txt

http://sec.gov/Archives/edgar/data/1019361/0001019361-04-000007.txt

http://sec.gov/Archives/edgar/data/1013934/0000950136-04-003588.txt

Thank you all again for everything!


In reply to Re: Help with speeding up regex by eversuhoshin
in thread Help with speeding up regex by eversuhoshin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.