TonyNY has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I need to parse an html file that has all of the data jumbled together. How do you recommend that I go about this? Regex or html parser?

I need t extract the FillDB info from the following two lines:

Information</div><br><div class="formline"><div class="formlabel">Fill +DB File Size Limit:</div><div class="forminput">0.0% ( 0 / 3145728 By +tes )</div></div><div class="formline"><div class="formlabel">FillDB +File Count Limit:</div><div class="forminput">0.0% ( 0 / 10000 Files +)</div></div><br><hr><div

thanks

Replies are listed 'Best First'.
Re: Parse html file
by marto (Cardinal) on Jul 30, 2018 at 21:46 UTC

    Even with your incomplete HTML, Mojo::DOM to the rescue:

    #!/usr/bin/perl use strict; use warnings; use Mojo::DOM; use feature 'say'; my $html = 'Information</div><br><div class="formline"><div class="for +mlabel">FillDB File Size Limit:</div><div class="forminput">0.0% ( 0 +/ 3145728 Bytes )</div></div><div class="formline"><div class="formla +bel">FillDB File Count Limit:</div><div class="forminput">0.0% ( 0 / +10000 Files)</div></div><br><hr><div'; my $dom = Mojo::DOM->new( $html ); for my $element ( $dom->find('.forminput')->each ){ say $element->text; }

    Output:

    0.0% ( 0 / 3145728 Bytes ) 0.0% ( 0 / 10000 Files)

    If you have the file locally, the docs show how to read it in, if you are parsing a live site, you could just combine it all at once using Mojo::UserAgent (for example).

    Update: Depending on the full page you may need to alter the selector somewhat. A full example would help.

Re: Parse html file
by haukex (Archbishop) on Jul 30, 2018 at 21:10 UTC
Re: Parse html file
by atcroft (Abbot) on Jul 30, 2018 at 21:16 UTC

    Honestly, the answer to that question depends on your constraints:

    • How soon do you need this data?
    • Is this a one-off, or will you need to utilize this more than this one time? (As someone's signature here said, "Makeshifts last longest.")
    • How complex is your actual data? (Is it just this, or was this a SSCCE?)
    • How comfortable do you feel working with a new-to-you module?

    A regex is probably the quick-and-dirty way, but is also the most likely to break if the data changes. There are several Tutorials here of modules designed to work with HTML (and several others designed to work with XML, but may also be of use).

    Given the time, were it me I would consider looking at those various Tutorials and experimenting with the modules shown there, but as I said, it depends on your criteria.

    Hope that helps.

Re: Parse html file
by davido (Cardinal) on Jul 30, 2018 at 21:21 UTC

    So you only have an HTML fragment to parse from, and not the entire HTML document?

    It's easier if you have a well-formed doc.


    Dave

      I have an actual file, I was just providing a fragment of the file

        Ok, then almost certainly the easiest approach is a parsing library. Regexes might seem easiest until the input becomes complex in ways not anticipated by the regex. Once you've drilled down to the point in the HTML document you want, using a parsing libary, you can then resort to a regex to pull the appropriate information from the target portion of the document obtained from the parser.


        Dave

      Hi davido,

      This is what I need to extract from the html file:

      Relay Status Information FillDB File Size Limit: 0.09% ( 2772 / 3145728 Bytes ) FillDB File Count Limit: 0.01% ( 1 / 10000 Files )

      Here is a sample of the html file:

      div><div class="settingsectionbody" style="display: none"><ul><li>_BES +Relay_PostFile_ChunkSize: 0</li><li>_BESRelay_PostFile_ComputerFolder +Count: 100</li><li>_BESRelay_PostFile_ThrottleKBPS: 0</li><li>_BESRel +ay_PostFile_TimeoutSeconds: 300</li><li>_BESRelay_UploadManager_Buffe +rDirectoryMaxCount: 10000</li><li>_BESRelay_UploadManager_BufferDirec +toryMaxSize: 1073741824</li><li>_BESRelay_UploadManager_CompressedFil +eMaxSize: 20971520</li><li>_BESRelay_UploadManager_ChunkSize: not app +licable on root server</li><li>_BESRelay_UploadManager_ThrottleKBPS: +not applicable on root server</li></ul></div></div><hr><div class="se +ctiontitle">Relay Status Information</div><br><div class="formline">< +div class="formlabel">FillDB File Size Limit:</div><div class="formin +put">0.0% ( 0 / 3145728 Bytes )</div></div><div class="formline"><div + class="formlabel">FillDB File Count Limit:</div><div class="forminpu +t">0.0% ( 0 / 10000 Files )</div></div><br><hr><div class="sectiontit +le">Console User Information</div><br><a href="/data/login"> div><div class="settingsectionbody" style="display: none"><ul><li>_BES +Relay_PostFile_ChunkSize: 0</li><li>_BESRelay_PostFile_ComputerFolder +Count: 100</li><li>_BESRelay_PostFile_ThrottleKBPS: 0</li><li>_BESRel +ay_PostFile_TimeoutSeconds: 300</li><li>_BESRelay_UploadManager_Buffe +rDirectoryMaxCount: 10000</li><li>_BESRelay_UploadManager_BufferDirec +toryMaxSize: 1073741824</li><li>_BESRelay_UploadManager_CompressedFil +eMaxSize: 20971520</li><li>_BESRelay_UploadManager_ChunkSize: not app +licable on root server</li><li>_BESRelay_UploadManager_ThrottleKBPS: +not applicable on root server</li></ul></div></div><hr><div class="se +ctiontitle">Relay Status Information</div><br><div class="formline">< +div class="formlabel">FillDB File Size Limit:</div><div class="formin +put">0.0% ( 0 / 3145728 Bytes )</div></div><div class="formline"><div + class="formlabel">FillDB File Count Limit:</div><div class="forminpu +t">0.0% ( 0 / 10000 Files )</div></div><br><hr><div class="sectiontit +le">Console User Information</div><br><a href="/data/login">

      My work environment is very strict so I am very limited in what modules can be installed.

      Thanks

        I see you have HTML::TreeBuilder installed. This is one way you can use that:

        Update: changed slightly to avoid errors

        my $tree = HTML::TreeBuilder->new; $tree->parse_file($file); $tree->eof; my @divs = $tree->find_by_attribute('class','formline'); for my $div (@divs) { my $label_div = $div->look_down('class','formlabel') or next; my $label = $label_div->as_text; my $input_div = $div->look_down('class','forminput') or next; my $input = $input_div->as_text; print "$label $input\n"; }