comment on

I'm banging my head against the wall on this one, and I don't understand why I'm getting these results.

I have a script I wrote that grabs an XML feed from a news site, extracts <link>, <pubDate> and <title> from the feed (via XML::Simple) follows the link referenced in the news feed to the original article, and then pulls the content out of the body of the article.

As part of the "final article" body extraction, I'm also trying to pull the author's name out of the HTML content itself, using a fairly simple regex.

While testing this, my regex stopped working, and I tried to debug it by writing the contents of $html to a local file, and examining that file.

What I have looks like this, for the relevant section:

  my $req   = HTTP::Request->new(GET => $link) or die $!;
  my $res   = $ua->request($req);
  my $html  = $res->content;

  # write_file() comes from File::Slurp
  # $item_id is the article ID extracted from <link>
  write_file($item_id,  {binmode => ':raw' }, $html);

  # Original source string looks like: 
  # <a href="http://news.example.com/?author=John_Smith">John Smith</a
+>
  my ($other, $author) = $html =~ /\?author=(.*?)">(.*)<\/a>/;

  # $author is blank, empty here, why?
  print "AUTHOR: $author\n";

  my $new_html = read_file($item_id);

  my ($n_other, $n_author) = $new_html =~ /\?author=(.*?)">(.*)<\/a>/;

  # Now $author contains the right name, "Mike Smith" for
  # example.
  print "AUTHOR: $n_author\n";
[download]

The problem I'm having, is that when I read the remote content into $html, via res->content, and try to extract $author from it, it fails.

When I write $html to disk, then IMMEDIATELY read that same physical file back from disk into a new scalar ($new_html above), and then run the same exact regex across it, it works fine.

WHY?!

In reply to A regex on the same content fails and works, with conditions by hacker

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.