Re^2: Parse html file

Replies are listed 'Best First'.
Re^3: Parse html file by davido (Cardinal) on Jul 30, 2018 at 21:59 UTC
Ok, then almost certainly the easiest approach is a parsing library. Regexes might seem easiest until the input becomes complex in ways not anticipated by the regex. Once you've drilled down to the point in the HTML document you want, using a parsing libary, you can then resort to a regex to pull the appropriate information from the target portion of the document obtained from the parser. Dave	[reply]
Re^4: Parse html file by TonyNY (Beadle) on Aug 01, 2018 at 14:15 UTC
I was able to use the HTML::TreeBuilder module to get some structure to the output. # Parse all of the contents of $file. my $parser = HTML::TreeBuilder->new (); $parser->parse_file ($file); # Now display the contents of $parser. recurse ($parser, 0); exit; # This displays the contents of $node and any children it may # have. The variable $depth is the indentation used. sub recurse { my ($node, $depth) = @_; # Print indentation according to the level of recursion. print " " x $depth; # If $node is a reference, then it is an HTML::Element. if (ref $node) { # Print the tag associated with $node, for example "html" or # "li". print $node->tag (), "\n"; # $node->content_list () returns a list of child nodes of # $node, which we store in @children. my @children = $node->content_list (); for my $child_node (@children) { recurse ($child_node, $depth + 1); } } else { # If $node is not a reference, then it is just a piece of text # from the HTML file. print $node, "\n"; } } [download] How can I extract the data from the following tags? `div div FillDB File Size Limit: div 0.0% ( 0 / 3145728 Bytes ) div div FillDB File Count Limit: div 0.0% ( 0 / 10000 Files )` [download]	[reply] [d/l] [select]
Re^5: Parse html file by marto (Cardinal) on Aug 01, 2018 at 14:28 UTC
this worked, even on the fragment. If you really wanted to capture 'FillDB File Size Limit:' it'd be trivial to add the required code.	[reply]
Re^6: Parse html file by TonyNY (Beadle) on Aug 01, 2018 at 14:41 UTC
Re^7: Parse html file by marto (Cardinal) on Aug 17, 2018 at 06:22 UTC
Re^7: Parse html file by Anonymous Monk on Aug 17, 2018 at 00:02 UTC
Some notes below your chosen depth have not been shown here