Re^4: How to parse not closed HTML tags that don't have any attributes? (updated)

Many thanks haukex!

That it's not reproducable is due to my own terrible incompetence. :-) I had tried to modify your example for the phone/fax section in such a way, that it would put these pairs into %hash:

{"Street name" => "Sample Street",
"House number" => "123",
"ZIP Code" => "45678",
"City name" => "Randomcity"}
[download]

With all the things I had tried, I only managed to get the string "Sample Street 12345678 Randomcity" into one of the fields, and the other one then was left empty, like:

{"Sample Street 12345678 Randomcity" => ""}
[download]

I guess my main mistake was to assume, that it's necessary to start out from the $dom all over again, for each and every HTML element. The crazy idea I had was to somehow grab "Sample Street 123" into one variable (starting from the "address" element), and "45678 Randomcity" into another, by somehow targeting, and starting from, the first <br/> element after the "address" element.

I'm still not sure why my <br/> always got stripped away, maybe because of my misunderstanding of how the "map" works:

use warnings;
use strict;
use Mojo::DOM;
use Mojo::Util qw/trim/;
use Data::Dump;

my $dom = Mojo::DOM->new(<<'HTML');
  <div class="address">
          <div class="icon"></div>
          <address>
            Sample Street 123<br/>45678 Randomcity          </address>
        </div>
HTML

my %hash_address = @{ $dom->find('address')->map(sub {
return ( trim($_->text), "This_is_the_address_content" ) }) };
dd \%hash_address;

__END__

{
  "Sample Street 12345678 Randomcity" => "This_is_the_address_content"
+,
}
[download]

Your solution is very elegant indeed, many thanks! :-)

Comment on Re^4: How to parse not closed HTML tags that don't have any attributes? (updated) Select or Download Code

Replies are listed 'Best First'.
Re^5: How to parse not closed HTML tags that don't have any attributes? by haukex (Archbishop) on Mar 07, 2021 at 17:47 UTC
I'm still not sure why my `<br/>` always got stripped away, In the code you show above, `->find('address')` is finding the `<address>` element, and then inside the `->map(sub { ... })`, `$_` is referring to that element, of which `$_->text` is getting only the text content, hence the missing `<br/>`. In the code I showed two nodes above, first I'm getting the `<address>` element into `$addr`, which preserves the document's structure, replacing the `<br/>`, and only then using `->text` to get the text content. I guess my main mistake was to assume, that it's necessary to start out from the $dom all over again, for each and every HTML element. `->find` will use whatever node you call it on as the context, so it depends on what part of the document you want to search and where in the document the nodes you're looking for can occur. The crazy idea I had was to somehow grab "Sample Street 123" into one variable (starting from the "address" element), and "45678 Randomcity" into another, by somehow targeting, and starting from, the first `<br/>` element after the "address" element. It's possible, sure - in the Document Object Model, the `<address>` element has three children: a text node `"Sample Street 123"`, the `<br/>` element, and another text node `"45678 Randomcity"` - you'll see this if you try looking at `$addr->child_nodes`. But I think this goes back to what I was saying about example code being brittle if written based on too few examples, and writing lots of test cases: so far, you've only shown two snippets of data out of what you said are 10,000 *.html files. So for example, marto's code makes the assumption that the phone and fax will always be the 2nd and 4th `<p>`s, respectively, my code here makes the assumption that it's always the next node after the `<p class="title">` that will contain the data (and that there are no double keys in the hash, and one or two other assumptions), my code here assumes that any element of `class="address"` contains only one `<address>` element that we're interested in, my code here assumes that the `<p>`s in elements of `class="phone"` are always in key+value pairs, and so on. My suggestions would be for you to first survey your input files, and see how much variation there is, so that you can boil it down to a representative set of test cases, and to code defensively, i.e. testing all of the assumptions I named above. Here's what that could look like: use warnings; use strict; use Mojo::DOM; use Mojo::Util qw/trim/; # this sub should really be in its own package for modularity sub get_data { my $html = shift; my %data; my $dom = Mojo::DOM->new($html); my $addr = $dom->find('.address address'); # could add some conditionals here # in case there are separate fields for street / city / zip etc. die "Didn't find exactly one address" unless @$addr==1; $addr = $addr->first; $addr->find('br')->map('replace',"\n"); $data{address} = { Address => trim( $addr->text ) }; my $phone = $dom->find('.phone p'); die "Didn't find an even number of elements in phone" if @$phone%2; while (@$phone) { my $key = trim( shift(@$phone)->text ); die "Duplicate key '$key' in phone data" if exists $data{phone}{$key}; $data{phone}{$key} = trim( shift(@$phone)->text ); } return \%data; } use Test::More; is_deeply get_data(<<'HTML'), <div class="address"> <div class="icon"></div> <address> Sample Street 123<br/>45678 Randomcity </address> </div> <div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90 <p class="title">Telefax</p> <p> </div> HTML { address => { Address => "Sample Street 123\n45678 Randomcity" }, phone => { Telephone => "0123-4 56 78 90", Telefax => "" }, }; # TODO: many more test cases here done_testing; [download]	[reply] [d/l] [select]
Re^6: How to parse not closed HTML tags that don't have any attributes? by Rantanplan (Novice) on Mar 08, 2021 at 14:39 UTC
Many thanks haukex! From what I've seen there isn't really a lot of variation in the input files. Since the project needs to be done and completed in two days ago, I'm currently concentrating on getting something into a CSV as quickly as possible. Later on, when there's still time, I can go back to making things more reliable. Your code is really great, and I've been (probably very noobishly) able to add other fields: `{ Address => { city => "Randomcity", street_and_nr => "SampleStreet 12 +3", zip => "45678" }, Company => { companyname => "Randomcompany" }, Phone => { Telefax => "", Telephone => "0123-4 56 78 90" }, }` [download] In your code, this is a hash, which gets returned from the subroutine as a pointer to a hash. If I understand correctly, inside the hash are three hashes ("Address", "Company" and "Phone"). Text::CSV however needs an array reference in order to work. I've spent about six hours today trying to learn about arrays, hashes, nested hashes, references to hashes etc., and to figure out a way to get Text::CSV running, by "unwrapping" the reference to a hash of hashes, getting a reference for each of the three included hashes, turning every of these hashes into an array, combining the arrays into one array, getting a reference to this array, and then calling Text::CSV with this reference. :-) Wouldn't it be much quicker to throw data inside the subroutine not into a hash of hashes, but directly into a single, not deep array instead, and return a reference to that array?	[reply] [d/l]
Re^7: How to parse not closed HTML tags that don't have any attributes? by haukex (Archbishop) on Mar 08, 2021 at 16:06 UTC
In your code, this is a hash, which gets returned from the subroutine as a pointer to a hash. If I understand correctly, inside the hash are three hashes ("Address", "Company" and "Phone"). Yes, that's correct, though in Perl we call them "references" instead of "pointers" (one of the differences being they're automatically memory-managed and garbage collected, with the exception of circular references). The full technical description is that `sub get_data` returns a reference to the hash `%data`, a hash that is newly allocated for each call to the `sub`, and whose values are references to other anonymous hashes. This is also called a "hash of hashes" or HoH, though the data structures can get arbitrarily complex. learn about arrays, hashes, nested hashes, references to hashes etc. Further reading: perldata, the Perl Data Structures Cookbook (perldsc) and perlreftut. Wouldn't it be much quicker to throw data inside the subroutine not into a hash of hashes, but directly into a single, not deep array instead, and return a reference to that array? Sure, that would certainly be an option. Personally I just like retaining as much information from the original data as possible, this usually allows for much easier future enhancements. For example, keeping the structure means you could easily also dump the data to JSON. figure out a way to get Text::CSV running, by "unwrapping" the reference to a hash of hashes, getting a reference for each of the three included hashes, turning every of these hashes into an array, combining the arrays into one array, getting a reference to this array, and then calling Text::CSV with this reference. One option of several to make the dereferencing a little easier might be Data::Diver. use warnings; use strict; use Data::Diver qw/Dive/; use Text::CSV; my @data = ( { Company => { companyname => "Randomcompany" }, Address => { city => "Randomcity", street_and_nr => "SampleStreet 123", zip => "45678" }, Phone => { Telephone => "0123-4 56 78 90" }, }, { Company => { companyname => "Other Company" }, Address => { address => "Someplace 42\n12345 City" }, Phone => { Telefax => "333", Telephone => "+1 234 567 8900" }, } ); my $csv = Text::CSV->new({binary=>1, auto_diag=>2, eol=>$/ }); $csv->print(select, ['Company','Address','Phone','Fax']); for my $rec (@data) { my $addr = Dive($rec, 'Address', 'address') \|\| Dive($rec, 'Address', 'street_and_nr') ."\n".Dive($rec, 'Address', 'zip') ." ".Dive($rec, 'Address', 'city'); $addr =~ s/\n/, /g; my @cols = ( scalar Dive($rec, 'Company', 'companyname'), $addr, scalar Dive($rec, 'Phone', 'Telephone'), scalar Dive($rec, 'Phone', 'Telefax'), ); $csv->print(select, \@cols); } __END__ Company,Address,Phone,Fax Randomcompany,"SampleStreet 123, 45678 Randomcity","0123-4 56 78 90", "Other Company","Someplace 42, 12345 City","+1 234 567 8900",333 [download] Note the reason I use scalar is because `Dive` is documented to return an empty list if it doesn't find anything, and the empty list interpolated into an array means that the following elements of the array would shift down accordingly. `scalar` forces a single return value, e.g. `undef`, so that this doesn't happen. It's not needed for `$addr` because that's already a scalar variable. In `$csv->print(select, \@cols)`, select gets the current default output handle, usually `STDOUT`, but you could just as well pass a filehandle here to write to an output file (see "open" Best Practices).	[reply] [d/l] [select]
Re^8: How to parse not closed HTML tags that don't have any attributes? by Rantanplan (Novice) on Mar 09, 2021 at 13:39 UTC