in reply to youtube parser/scrabber

Who complains is JSON::MaybeXS. Because it is asked to decode the following:

{args: {raw_player_response:window.ytplayer.bootstrapPlayerResponse} }; if(window.ytcsi)window.ytcsi.tick("cfg",null,"")}

that could be the wrong response signifying an outdated scrapper (likely). On the other hand, it looks to me to be wrong JSON but I am not a JSON expert. The first part is JSON and can be fixed with quoting all strings (no?). The rest looks like broken javascript.

It will give you a nice excuse to avoid the seaside and open up a terminal window to crack it ...

bw, bliako

#EDIT: here is what lies in line 298 sub _get_args { my ($self, $content) = @_; my $data; for my $line (split "\n", $content) { next unless $line; if ($line =~ /the uploader has not made this video available i +n your country/i) { croak 'Video not available in your country'; } # The following regex looks like it is asking for trouble # memo-to-self: can't parse javascript with regex... elsif ($line =~ /^.+ytplayer\.config\s*=\s*(\{.*})/) { print STDERR "BBBBB: |||$1|||\n"; ($data, undef) = JSON->new->utf8(1)->decode_prefix($1); # +<<< 298 last; } } croak 'failed to extract JSON data' unless $data->{args}; return $data->{args}; }

Replies are listed 'Best First'.
Re^2: youtube parser/scrabber
by bliako (Abbot) on Aug 19, 2021 at 12:47 UTC

    I have added a ; at the end of said regex and now have this: ^.+ytplayer\.config\s*=\s*(\{.*?};)

    For this particular use-case the above regex extracts the JSON. Although JSON's decode_prefix() will ignore any trailing non-JSON (e.g. the Javascript I mentioned) content. Now, regarding the problem of unquoted keys and values. There is a allow_barekey() option to the JSON parser which will allow keys not to be quoted.

    And you need to deal with the remaining problem of unquoted values. Unquoted values may be indicative of a much bigger problem: that values in the "JSON" (which is actually a Javascript hash) are function calls or other hash values, variables etc.! For example, this is the line that _get_args() looks for:

    if(createPlayer){ if(window.ytplayer.bootstrapPlayerResponse){ window.ytplayer.config={args:{raw_player_response:window.ytplayer. +bootstrapPlayerResponse}}; ...

    There is a reason why it is unquoted I think ...

    So, yes the scrapper looks outdated (though very recently updated) and you are better off using something else.

    bw, bliako