in reply to Re: Perl Possibilities
in thread Perl Possibilities

As the data already is in a fairly tabular format, and in HTML, I would use HTML::TableExtractor to get at the table data. With the data in hand, it should be easy to extract the vote recommendations by looking whether FOR or AGAINST is contained in the relevant column.

Replies are listed 'Best First'.
Re^3: Perl Possibilities
by Gideau (Novice) on Mar 16, 2016 at 12:57 UTC
    You're right about that indeed. However, the problem is that very few companies use such a table as in the example where they clearly state the proposals and their recommendations, as far as I know.

    Furthermore, I've already downloaded quite a few filings for testing purposes, and they end up being in .txt file however still formatted in html (so you see all the html code in the .txt surrounding the actual text). Would you say it's smarter to keep the .txt or convert back to .html before I do the extraction scripts?

      It won't really matter, as the relevant data basically stays the same. You will then face the problem of actually associating the proposal title with the proposed vote.

      I would look at trying to write a program that can handle some/most of the filings and that will submit the rest of the filings for a human to decide.

      "...they end up being in .txt file however still formatted in html (so you see all the html code in the .txt surrounding the actual text...."

      This shouts "I haven't bothered to understand either html or the various meanings of 'text'." The last word in the quote above uses the word "text" the sense of "textual content." The references to ".txt" refer to a file format; in this case, a document (something.html) that is comprised to ASCII or UTF8 characters.

      Since you say the html markup ("code") is still present (visible), you'll almost certainly have html formatted files by merely changing the file extension from .txt to .htm.

      But you've asked quite enough questions1 that reflect an utter lack of person effort. This is going to be your project; your thesis; and your future; not ours. So build a good foundation by taking the trouble to understand at least the basics of the relevant technology (and, as has already been suggested, understand how, when and why to seek help here).

      1   to wit, Re: Perl Possibilities, Re^3: Perl Possibilities, Re: Perl Possibilities where the link leaves the tedium of finding the material to which you refer (ANNUAL MEETING PROPOSALS ) to the Monk seeking to help. The observations here also apply to the node to which this is addressed (Re^3: Perl Possibilities) and my point is that those who seek the benefit of Monks effort should maximize their own beforehand.


      ++$anecdote ne $data


        Since you say the html markup ("code") is still present (visible), you'll almost certainly have html formatted files by merely changing the file extension from .txt to .htm.

        You clearly didn't click the OP's link - it actually is a HTML file surrounded by some other markup-like text, contained within a .txt file.