in reply to Parse html file

So you only have an HTML fragment to parse from, and not the entire HTML document?

It's easier if you have a well-formed doc.


Dave

Replies are listed 'Best First'.
Re^2: Parse html file
by TonyNY (Beadle) on Jul 30, 2018 at 21:24 UTC
    I have an actual file, I was just providing a fragment of the file

      Ok, then almost certainly the easiest approach is a parsing library. Regexes might seem easiest until the input becomes complex in ways not anticipated by the regex. Once you've drilled down to the point in the HTML document you want, using a parsing libary, you can then resort to a regex to pull the appropriate information from the target portion of the document obtained from the parser.


      Dave

        I was able to use the HTML::TreeBuilder module to get some structure to the output.

        # Parse all of the contents of $file. my $parser = HTML::TreeBuilder->new (); $parser->parse_file ($file); # Now display the contents of $parser. recurse ($parser, 0); exit; # This displays the contents of $node and any children it may # have. The variable $depth is the indentation used. sub recurse { my ($node, $depth) = @_; # Print indentation according to the level of recursion. print " " x $depth; # If $node is a reference, then it is an HTML::Element. if (ref $node) { # Print the tag associated with $node, for example "html" or # "li". print $node->tag (), "\n"; # $node->content_list () returns a list of child nodes of # $node, which we store in @children. my @children = $node->content_list (); for my $child_node (@children) { recurse ($child_node, $depth + 1); } } else { # If $node is not a reference, then it is just a piece of text # from the HTML file. print $node, "\n"; } }

        How can I extract the data from the following tags?

        div div FillDB File Size Limit: div 0.0% ( 0 / 3145728 Bytes ) div div FillDB File Count Limit: div 0.0% ( 0 / 10000 Files )
Re^2: Parse html file
by TonyNY (Beadle) on Sep 24, 2018 at 12:31 UTC

    Hi davido,

    This is what I need to extract from the html file:

    Relay Status Information FillDB File Size Limit: 0.09% ( 2772 / 3145728 Bytes ) FillDB File Count Limit: 0.01% ( 1 / 10000 Files )

    Here is a sample of the html file:

    div><div class="settingsectionbody" style="display: none"><ul><li>_BES +Relay_PostFile_ChunkSize: 0</li><li>_BESRelay_PostFile_ComputerFolder +Count: 100</li><li>_BESRelay_PostFile_ThrottleKBPS: 0</li><li>_BESRel +ay_PostFile_TimeoutSeconds: 300</li><li>_BESRelay_UploadManager_Buffe +rDirectoryMaxCount: 10000</li><li>_BESRelay_UploadManager_BufferDirec +toryMaxSize: 1073741824</li><li>_BESRelay_UploadManager_CompressedFil +eMaxSize: 20971520</li><li>_BESRelay_UploadManager_ChunkSize: not app +licable on root server</li><li>_BESRelay_UploadManager_ThrottleKBPS: +not applicable on root server</li></ul></div></div><hr><div class="se +ctiontitle">Relay Status Information</div><br><div class="formline">< +div class="formlabel">FillDB File Size Limit:</div><div class="formin +put">0.0% ( 0 / 3145728 Bytes )</div></div><div class="formline"><div + class="formlabel">FillDB File Count Limit:</div><div class="forminpu +t">0.0% ( 0 / 10000 Files )</div></div><br><hr><div class="sectiontit +le">Console User Information</div><br><a href="/data/login"> div><div class="settingsectionbody" style="display: none"><ul><li>_BES +Relay_PostFile_ChunkSize: 0</li><li>_BESRelay_PostFile_ComputerFolder +Count: 100</li><li>_BESRelay_PostFile_ThrottleKBPS: 0</li><li>_BESRel +ay_PostFile_TimeoutSeconds: 300</li><li>_BESRelay_UploadManager_Buffe +rDirectoryMaxCount: 10000</li><li>_BESRelay_UploadManager_BufferDirec +toryMaxSize: 1073741824</li><li>_BESRelay_UploadManager_CompressedFil +eMaxSize: 20971520</li><li>_BESRelay_UploadManager_ChunkSize: not app +licable on root server</li><li>_BESRelay_UploadManager_ThrottleKBPS: +not applicable on root server</li></ul></div></div><hr><div class="se +ctiontitle">Relay Status Information</div><br><div class="formline">< +div class="formlabel">FillDB File Size Limit:</div><div class="formin +put">0.0% ( 0 / 3145728 Bytes )</div></div><div class="formline"><div + class="formlabel">FillDB File Count Limit:</div><div class="forminpu +t">0.0% ( 0 / 10000 Files )</div></div><br><hr><div class="sectiontit +le">Console User Information</div><br><a href="/data/login">

    My work environment is very strict so I am very limited in what modules can be installed.

    Thanks

      I see you have HTML::TreeBuilder installed. This is one way you can use that:

      Update: changed slightly to avoid errors

      my $tree = HTML::TreeBuilder->new; $tree->parse_file($file); $tree->eof; my @divs = $tree->find_by_attribute('class','formline'); for my $div (@divs) { my $label_div = $div->look_down('class','formlabel') or next; my $label = $label_div->as_text; my $input_div = $div->look_down('class','forminput') or next; my $input = $input_div->as_text; print "$label $input\n"; }
        Thanks tangent, your solution pulled the data.

        any idea what this error is referring to?

        Can't call method "as_text" on an undefined value