dchandler has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have a bunch of files that contain table data. Some of them are html, some of them appear to be sgml? I am not familiar with SGML but I see tags like "page" and also "C" which i guess is columns. Are there any easy ways to extract sgml (are there any addons like HTML::parser for sgml?) Also, how do i identify the markup language of a file? I simply read all the files through perl and then printed them into text files using perl so I've hidden the issue. Now I have many text files that have tables in them? So my next question is, are there easy ways to extract table data from text? Or are plain ol' regular expressions the way to go on this? I'm fairly adept at regular expressions but don't want to senselessly make them if there are already addons that do this.

Thanks,

Dana

Replies are listed 'Best First'.
Re: question about extracting data tables
by b10m (Vicar) on Dec 26, 2004 at 21:52 UTC

    I tend to like HTML::TableExtract for ... ermm ... Extracting data from HTML Tables ;)

    --
    b10m

    All code is usually tested, but rarely trusted.
Re: question about extracting data tables
by jZed (Prior) on Dec 26, 2004 at 22:03 UTC
    You can extract tabular data from HTML files, XML files, CSV files, Fixed Width Files, and many other formats using AnyData (a tied-hash interface) or DBD::AnyData (a DBI/SQL interface) modules. For HTML tables, both use the excellent HTML::TableExtract module that b10m mentioned.
      dumb question... is XML the same as SGML? I think these files are SGML, will anydata work on them?
        is XML the same as SGML?

        XML is a subset of SGML. So, generally speaking, an SGML tool will probably work with XML data, but an XML tool might not work with SGML data. Your specific case might not be a problem, though, so it's worth a shot.

        Yes, as revdiablo said, XML is a subset of SGML, so is HTML. Some SGML can be parsed with an XML parser, it all depends.