getmizanur has asked for the wisdom of the Perl Monks concerning the following question:

Here's a script which I have created so that I can fetch the contents of particular news article using HTML::TreeBuilder::XPath module to create a XML file. However, I can't get it to get the html tags when using findvalue method.

#!/usr/bin/perl -w use HTML::LinkExtor; use LWP::Simple; use HTML::TreeBuilder::XPath; use Term::ProgressBar; my $url = "http://www.totalpolitics.com/blog/463546/senior-medics-call +-on-uk-to-stay-but-brexiteers-have-more-to-cheer.thtml"; my $content = get $url; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $title = $tree->findvalue(q{//div[@id="article"]/h1}); my $body = $tree->findvalue(q{//div[@class="article-body"]}); my $author = $tree->findnodes(q{//div[@class="article-body"]/p/strong} +); $author = $author->[0]->getValue; $body =~ s/$author//; my $xml .= '<?xml version="1.0" encoding="UTF-8" ?>'; $xml .= '<nodes>'; $xml .= '<node>'; $xml .= '<url>'; $xml .= $url; $xml .= '</url>'; $xml .= '<title>'; $xml .= $title; $xml .= '</title>'; $xml .= '<description>'; $xml .= "<![CDATA[$body]]>"; $xml .= '</description>'; $xml .= '<author>'; $xml .= $author; $xml .= '</author>'; $xml .= "</node>\n"; $xml .= "</nodes>"; print $xml;

What is happening? ...No "< p >" tags in the description field.

<?xml version="1.0" encoding="UTF-8"?> <nodes> <node> <url>http://www.totalpolitics.com/blog/463546/senior-medics-call +-on-uk-to-stay-but-brexiteers-have-more-to-cheer.thtml</url> <title>Senior medics call on UK to stay - but Brexiteers have mo +re to cheer</title> <description><![CDATA[The referendum battle steps up today with +the Remain camp offering up a consortium of medics – but other interv +entions suggest the Brexiteers may have more to cheer in the weeks to + come.A group of 188 clinicians, academics and public health leaders +have written to the Times claiming the NHS would be in jeopardy if th +e UK were to leave the EU, losing access to “finances, staffing and e +xchanges”. They added:“As health professionals and researchers we wri +te to highlight the valuable benefits of continued EU membership to t +he NHS, medical innovation and UK public health. We have made enormou +s progress over decades in international health research, health serv +ices innovation and public health. Much of this is built around share +d policies and capacity across the EU.”Britain Stronger In’s decision + to use medics is reflective of recent polling by Ipsos Mori, which s +hows 89% trust doctors to tell the truth, making them the most truste +d profession in the UK. Politicians languish at the bottom of the lea +gue table on 21%.However the intervention may pale into insignificanc +e after Brexiteers won an important battle to reveal the true number +of European migrants working in Britain. The number of migrants with +active national insurance (NI) numbers will now be released just week +s before the referendum.According to existing data, about 800,000 EU +migrants have moved to the UK in the past four years. However over th +e same period about 2m EU migrants have been issued with NI numbers.C +ampaigners have long been calling upon the government to release figu +res for the number of people with active NI numbers, which they say w +ill provide a more accurate gauge of migration levels than existing o +fficial figures.Andrew Tyrie, chairman of the Treasury select committ +ee, said: “This has been obtained as a result of a good deal of persi +stence … Late, but a good deal better than never. I recognise that HM +RC may have encountered some difficulties. So I am glad that they hav +e found a way of resolving them.”The decision comes as the Sun report +s an extraordinary outburst from David Cameron as he returned from Wa +shington this weekend.Asked whether the prime minister was so distrac +ted by the referendum campaign that he had taken his eye off the ball +, leading to a poorly-received budget and the crisis in the steel ind +ustry, he replied:“I think you all spend too much time looking at eac +h other’s newspapers. The world hasn't stopped turning, the Governmen +t hasn't stopped operating. You all go around setting each others' ha +ir on fire and getting very excited about this, but it's all a lot of + processology.”But new polling out today suggests the Leave campaign +has the more compelling set of arguments over Europe. The Fabian Soci +ety research shows how an initial four-point leave for Remain among l +ikely voters turns into a two-point lead for Leave once people heard +both sides of the story.Research found that the two most important is +sues to deciding how people vote in the referendum are immigration an +d controlling our laws – and on both issues the public are far more c +onvinced that leaving the EU will help solve the problem.@Tom_Smithar +dTotal Politics has a free weekly Friday email bulletin. Follow this +link to register.]]></description> <author>Photo: Lynne Cameron / PA Wire / Press Association Image +s</author> </node> </nodes>

What I like to see in the XML file? ...I want to have "< p >" in the description filed.

Replies are listed 'Best First'.
Re: Getting HTML tag using Perl HTML::TreeBuilder::Xpath module
by Anonymous Monk on Apr 13, 2016 at 14:51 UTC
Re: Getting HTML tag using Perl HTML::TreeBuilder::Xpath module
by Anonymous Monk on Apr 13, 2016 at 13:41 UTC