in reply to Help with speeding up regex
Hello Thank you all so much for the helpful suggestions. I will need some time to fully digest them since I am still learning perl :)
Basically, my script identifies the number of false positive words related to management guidance. I need to do this so I don't have to go through all the financial filings. So through manual processing, I figured out words that seem to be related to guidance but do not have anything to do with the actual guidance. The regex code that I posted is that list that I compiled. By counting the number of false positive words I know that this filing is irrelevant and I will not have read it later for processing.
I have changed the code a bit and used File::Map to speed it up but I am not sure if I am doing it right. Also, someone asked if the regex worked. Yes, regex works but it is slow and I am trying to make it faster.
map_file my($data), $filing; $fcount=()=$data=~m/outlook\s+for\s+any\s+rating|(?:rating|if\s+on\ +s+negative|Microsoft|suggesting\s+an|may\s+contain\s+statements\s+abo +ut\s+future\s+events\,|business\s+conditions\s+and\s+the)\s+outlook|g +uidance\s+(?:to\s+approve|facility) |(?:authoritative|revenue\s+recognition|invaluable\s ++practical|valuable|regulatory|technical|under\s+the|staff\'s|judicia +l|SEC|FDA|Treasury(?:\s+Department)?|specific|implementation|their|go +vernment|any\s+ruling|college|absent|\s+his|interim|intrepretive|tran +sition|administrative|procedural|related|applicable|accounting|defini +tive|superceding|IRS|Internal\s+Revenue\s+Service|valued|EITF\s+accou +nting)\s+guidance |guidance\s+(?:and\s+rules|promulgated(?:\s+thereund +er)?|in\s+SFAS)|(?:provided|issued)\s+by\s+(?:the\s+)?(?:SEC|Securiti +es\s+and\s+Exchange\s+Commission|Internal\s+Revenue\s+Service|Secreta +ry|United\s+States|Financial\s+Accounting) |(?:other|applicable)\s+guidance\s+issued|according\ +s+to\s+the\s+guidance\s+contained|provide\s+guidance\s+to\s+directors +|receiving\s+guidance |(?:current|other)\s+guidance\s+(?:under|from)|assum +es\s+guidance\s+of\s+(?:the|a)\s+(?:company|board|talented\s+team|com +pensation)|guidance\s+(?:system|software|technology) /xig;
I am also attaching some sample text
http://sec.gov/Archives/edgar/data/1011737/0001193125-06-122041.txt
http://sec.gov/Archives/edgar/data/1012270/0001104659-07-059430.txt
http://sec.gov/Archives/edgar/data/1016281/0001104659-03-016871.txt
http://sec.gov/Archives/edgar/data/1166036/0001104659-09-021080.txt
http://sec.gov/Archives/edgar/data/1019361/0001019361-04-000007.txt
http://sec.gov/Archives/edgar/data/1013934/0000950136-04-003588.txt
Thank you all again for everything!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Help with speeding up regex
by BrowserUk (Patriarch) on Aug 13, 2012 at 00:08 UTC | |
by eversuhoshin (Sexton) on Aug 14, 2012 at 02:54 UTC | |
by BrowserUk (Patriarch) on Aug 14, 2012 at 03:04 UTC |