Grabbing the words

agynr has asked for the wisdom of the Perl Monks concerning the following question:

Hello All, I have a document and in the document(of around 10kb) from a place(suppose between 1016th character and 1040 character ) I want to grab the characters from 1 to 1015 and 1041 to the end of the document and make left str and the right str respectively. And then make those strings as regexp pattern ie., convert the spaces to \s* etc. Can u plz tell me that if I want to grab word by word from the document. Example

Suppose in the beginning we have
$pattern="(\s*TOTAL\s*FUND\s*OPERATING\s*)(\S.{0,21}?\S)(\s*\s*EXPENSE
+S\s*F\s*)"
[download]

And after the pattern is occuring so many times in the document in which I had to search then I had to increase the size of the search .After increasing the size of the pattern it becomes

$pattern="(ther\s*Expenses\s*0\s*2600\s*\s*TOTAL\s*FUND\s*OPERATING\s*
+)(\S.{0,21}?\S)(\s*\s*EXPENSES\s*Fund\s*as\s*a\s*shareholder\s*in\s*u
+nderlying\s*fund\s*indirectly\s*bears\s*pro\s*rata\s*)"
[download]

My problem is to make the pattern and pick word by word from the left of the pattern and word from the right of the pattern.These things will be applied to only the target document. Can u plz help me in this context.

Comment on Grabbing the words Select or Download Code

Replies are listed 'Best First'.
Re: Grabbing the words by manav (Scribe) on Mar 22, 2005 at 11:01 UTC
Have you tried anything so far?? Please show us some code, maybe we can judge the problem better Also, have you tried hacking up your algorithms though the use of $' and $` ? They will match the string on the left side and on the right side of a successfull regex-match. But they are pretty heavy performance wise, and perl doesnt get them unless you specifically state that you require (by using them). And if you use them once, Perl will make them available for all regex matches in your code, thats pretty heavy.... Manav	[reply]
Re^2: Grabbing the words by agynr (Acolyte) on Mar 22, 2005 at 12:18 UTC
I want to get the next word from the pattern.For increasing the size of the pattern I have to get the next word from the left and right both ways. Suppose in the beginning we have `$pattern="(\sTOTAL\sFUND\sOPERATING\s)(\S.{0,21}?\S)(\s\sEXPENSE +S\sF\s)"` [download] And after the pattern is occuring so many times in the document in which I had to search then I had to increase the size of the search .After increasing the size of the pattern it becomes `$pattern="(ther\sExpenses\s0\s2600\s\sTOTAL\sFUND\sOPERATING\s +)(\S.{0,21}?\S)(\s\sEXPENSES\sFund\sas\sa\sshareholder\sin\su +nderlying\sfund\sindirectly\sbears\spro\srata\s)"` [download] I hope this will help u get a better picture of the problem.	[reply] [d/l] [select]
Re^3: Grabbing the words by jhourcle (Prior) on Mar 22, 2005 at 12:39 UTC
I'm confused as to why you have to 'increase the size of the search'. (by which I'm not sure if you mean you're just changing the allowed range within the second set of capturing parenthesis, or something else). Could you also give some sample input, and what you would like as the output? I think that would help me understand the problem. (assuming the data that you're working with isn't confidential for some reason... which it might be, if its financial reports, based on the headers) Update: okay, I should've looked at the two regex closer -- he's adding to the first and third capturing sets. (I still don't understand why that logic is being used, though	[reply]