porsche5k has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perlmonks,

I'm Patrick, this is my first post.

Here is my situation:

I have a txt file where I'm trying to extract "Significant Accounting Policies" (SAP) from a 10k. There usually is a number to the left of SAP.

I would like to take that number and make it a variable then add 1 to that variable to identify the following section so then I can end the collection of the SAP. For example, if the number in front of SAP is 3, I would extract 3, create a variable, add 1, and my new variable would be 4. The code would then identify that the following section is section 4 and I can end my extraction at the beginning of section 4.

Here is an example:

(1) Summary of Significant Accounting Policies

Revenue Recognition

Revenue is recognized at the time goods are sold and shipped.

(2) Long-term Debt

****

the number in front of SAP is not always the number 1 and the following item is not always long-term debt.

#!/usr/bin/perl -w #use strict; # This program extracts data from an SEC filing, including chunks of t +ext use Benchmark; #get the HTML-Format package from the package manager. use HTML::Formatter; #get the HTML-TREE from the package manager use HTML::TreeBuilder; use HTML::FormatText; $startTime = new Benchmark; #This program is written to obtain "Significant Accounting Policies" w +hich are typically found in item 4/4 of the 10k my $startstring='((^\s*?)Significant Accounting Policies\s\n)'; #Specify keywords/phrases you expect to find within the item (make sur +e the words phrases are not also in the start or end string) my $keywords='(estimates|Financial Accounting Standards Board|Reven +ue Recognition|generally accepted accounting principles|accruals|inve +ntories|straight-line|)'; #horizontal line means alternative match if + not found #Specify the end of the text you are looking for. # Need to create a flexible end string that uses +1 the variable in fr +ont of "SAP" my $endstring='((^\s*?)Item\s+(10)[\.\-]?[^\d]*?\n)'; my $direct="D:\\ExternalFiles\\Edgar\\tenks\\randomsort";

Replies are listed 'Best First'.
Re: Making a variable from a number in front of a string
by choroba (Cardinal) on Jul 15, 2016 at 15:54 UTC
    I used a flag variable to tell me whether I'm inside the correct section. Also, using `(...)` in a regex creates a capture group, you can add 1 to it as to any other variable. Note that you need to backslash the parentheses to give them their literal meaning.
    #!/usr/bin/perl use warnings; use strict; my $printing; my $next_section = q(); while (<DATA>) { if (/^\(([0-9]+)\) .*Significant Accounting Policies/) { $printing = 1; $next_section = 1 + $1; } elsif (/^\($next_section\)\s/) { undef $printing; } print if $printing; } __DATA__ (1) Preface What is this all about. (2) Summary of Significant Accounting Policies Revenue Recognition Revenue is recognized at the time goods are sold and shipped. (3) Long-term Debt
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Thanks for the response. Here is a complete newbie question but what goes in "(<DATA>)?"

      Is that the file location?

        The <DATA> file handle makes it possible to read the __DATA__ section at the end of the program, i.e. in this case:
        __DATA__ (1) Preface What is this all about. (2) Summary of Significant Accounting Policies Revenue Recognition Revenue is recognized at the time goods are sold and shipped. (3) Long-term Debt
        It simulates another file and it is a way to quickly test programs without having to create a separate test input file
        Maybe we didn't give you a crystal clear reply. First, the program should run using the data segment. Then take the data after __DATA__ and put that into "somefile". In the program, you need to add at the beginning,

        open FILE, '<', "somefile" or die "unable to open file $!";
        Now put FILE everywhere that DATA appears in the program. You will now be reading "somefile" on the disk instead of the __DATA__ section.
Re: Making a variable from a number in front of a string
by Marshall (Canon) on Jul 15, 2016 at 16:09 UTC
    I was wondering if it is even necessary to calculate the next section number? The code below starts with the "Significant Accounting Policies" section (doesn't really care about the number) and captures up to but not including the next line that starts with "(x" where x is a number.

    See the Monk Tutorial at Flipin good, or a total flop? for an explanation of the flip-flop operator and how I excluded the ending point. the "(2" line.

    #!/usr/bin/perl use warnings; use strict; while (<DATA>) { #use flip flop operator and exclude ending point print if (/^\s*\(\d+.+Significant Accounting Policies/.../^\s*\(\d+ +/) =~ /^\d+$/ ; } __DATA__ (1) Summary of Significant Accounting Policies Revenue Recognition Revenue is recognized at the time goods are sold and shipped. (2) Long-term Debt (3) something else
    Update: Another way to code this:
    #!/usr/bin/perl use warnings; use strict; while (<DATA>) { if (/^\s*\(\d+\).+Significant Accounting Policies/) { print; #this is the SAP heading line print while (defined ($_ = <DATA>) and $_ !~ /^\s*\(\d+\)/); } } __DATA__