Hopefully you can add/install modules, or those modules are already there. Note that you can bundle your code together in a number of ways, which was recently discussed here, which means that they aren't strictly needed on the server, you can ship them with whatever you have, and they will either live with your code or they will get installed, whichever way you go.

CSV: Text::CSV_XS. Separating things by commas sounds straight forward (join ',', @list), but there are corner/edge cases that may crop up and will get you to throw your hands up in frustration. Text::CSV_XS handles those cases for you, both for parsing and writing.

Parsing HTML: if they are well-formed XHTML, I prefer XML::Twig, but if not, check CPAN for html parsers. Again, you probably could regex the search for your meta names, but unless they are always exactly the same format (all on one line, always with name before content, neither of which are strictly required by HTML), it will be painful. Let a module do the heavy work for you, and you should be able to ask for the meta element with a name attribute of 'description', then you ask for the value of its 'content' key.

Looking for the HTML files: File::Find, which *is* included in the perl you're using, though I've seen others prefer File::Find::Rule, which is not included in the core Perl distribution (but may be installed on your server anyway).

Once you have all the documentation and modules, and you have your plan on how to "distribute" your code to the server, you can pull it all together. If you write it well, I'm guessing that your code will amount to 20-50 lines. That's it. With the contents of CPAN, I can't think of another scripting language that is better suited to what you're doing.


In reply to Re: Extracting Data from a File by Tanktalus
in thread Extracting Data from a File by globaldre

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.