Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

youtube parser/scrabber

by igoryonya (Pilgrim)
on Aug 18, 2021 at 09:52 UTC ( [id://11135919]=perlquestion: print w/replies, xml ) Need Help??

igoryonya has asked for the wisdom of the Perl Monks concerning the following question:

I tried to use WWW::YouTube::Download, but it seems broken. I wanted to use it to gather some info from the youtube url, i.e.:
  1. title
  2. list of all possible video and audio quality urls
at least, but I can't do even that. I get an error:

'"' expected, at character offset 1 (before "args:{raw_player_res...") at /usr/local/share/perl/5.30.0/WWW/YouTube/Download.pm line 298.

I was going to provide a sample of the code, but then I found out, that this module comes with the utility: youtube-download. Trying to run:

youtube-download 'https://www.youtube.com/watch?v=MzZ8IaYkf7M'

brings up exactly the same error message, as above.

Are there some other modules, that allow to gather useful data from the YouTube url/content, that are not broken/outdated and don't have youtube-dl or any other external utility dependency?

Replies are listed 'Best First'.
Re: youtube parser/scrabber
by Corion (Patriarch) on Aug 18, 2021 at 11:35 UTC

    I would look at what requests youtube-dl sends and replicate that. youtube-dl itself has an option to output all the data it scrapes as JSON, but if you don't want it as external dependency, that's out.

    jwz maintains youtubedown, which you can look at to find what/how it scrapes the information. At one time I looked at converting that to a module to make it accessible from other programs, but that wasn't as easy as I thought either.

      Big one - 145KB without the comments!
      And it supports more, then just the youtube.
      A lot of work was put into it!
Re: youtube parser/scrabber
by bliako (Monsignor) on Aug 18, 2021 at 13:57 UTC

    Who complains is JSON::MaybeXS. Because it is asked to decode the following:

    {args: {raw_player_response:window.ytplayer.bootstrapPlayerResponse} }; if(window.ytcsi)window.ytcsi.tick("cfg",null,"")}

    that could be the wrong response signifying an outdated scrapper (likely). On the other hand, it looks to me to be wrong JSON but I am not a JSON expert. The first part is JSON and can be fixed with quoting all strings (no?). The rest looks like broken javascript.

    It will give you a nice excuse to avoid the seaside and open up a terminal window to crack it ...

    bw, bliako

    #EDIT: here is what lies in line 298 sub _get_args { my ($self, $content) = @_; my $data; for my $line (split "\n", $content) { next unless $line; if ($line =~ /the uploader has not made this video available i +n your country/i) { croak 'Video not available in your country'; } # The following regex looks like it is asking for trouble # memo-to-self: can't parse javascript with regex... elsif ($line =~ /^.+ytplayer\.config\s*=\s*(\{.*})/) { print STDERR "BBBBB: |||$1|||\n"; ($data, undef) = JSON->new->utf8(1)->decode_prefix($1); # +<<< 298 last; } } croak 'failed to extract JSON data' unless $data->{args}; return $data->{args}; }

      I have added a ; at the end of said regex and now have this: ^.+ytplayer\.config\s*=\s*(\{.*?};)

      For this particular use-case the above regex extracts the JSON. Although JSON's decode_prefix() will ignore any trailing non-JSON (e.g. the Javascript I mentioned) content. Now, regarding the problem of unquoted keys and values. There is a allow_barekey() option to the JSON parser which will allow keys not to be quoted.

      And you need to deal with the remaining problem of unquoted values. Unquoted values may be indicative of a much bigger problem: that values in the "JSON" (which is actually a Javascript hash) are function calls or other hash values, variables etc.! For example, this is the line that _get_args() looks for:

      if(createPlayer){ if(window.ytplayer.bootstrapPlayerResponse){ window.ytplayer.config={args:{raw_player_response:window.ytplayer. +bootstrapPlayerResponse}}; ...

      There is a reason why it is unquoted I think ...

      So, yes the scrapper looks outdated (though very recently updated) and you are better off using something else.

      bw, bliako

Re: youtube parser/scrabber
by Anonymous Monk on Aug 18, 2021 at 13:09 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11135919]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-24 04:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found