youtube parser/scrabber

igoryonya has asked for the wisdom of the Perl Monks concerning the following question:

I tried to use WWW::YouTube::Download, but it seems broken. I wanted to use it to gather some info from the youtube url, i.e.:

title
list of all possible video and audio quality urls

at least, but I can't do even that. I get an error:

'"' expected, at character offset 1 (before "args:{raw_player_res...") at /usr/local/share/perl/5.30.0/WWW/YouTube/Download.pm line 298.

I was going to provide a sample of the code, but then I found out, that this module comes with the utility: youtube-download. Trying to run:

youtube-download 'https://www.youtube.com/watch?v=MzZ8IaYkf7M'

brings up exactly the same error message, as above.

Are there some other modules, that allow to gather useful data from the YouTube url/content, that are not broken/outdated and don't have youtube-dl or any other external utility dependency?

Comment on youtube parser/scrabber Select or Download Code

Replies are listed 'Best First'.
Re: youtube parser/scrabber by Corion (Patriarch) on Aug 18, 2021 at 11:35 UTC
I would look at what requests `youtube-dl` sends and replicate that. `youtube-dl` itself has an option to output all the data it scrapes as JSON, but if you don't want it as external dependency, that's out. jwz maintains youtubedown, which you can look at to find what/how it scrapes the information. At one time I looked at converting that to a module to make it accessible from other programs, but that wasn't as easy as I thought either.	[reply] [d/l] [select]
Re^2: youtube parser/scrabber by igoryonya (Pilgrim) on Aug 18, 2021 at 13:08 UTC
Big one - 145KB without the comments! And it supports more, then just the youtube. A lot of work was put into it!	[reply]
Re: youtube parser/scrabber by bliako (Monsignor) on Aug 18, 2021 at 13:57 UTC
Who complains is JSON::MaybeXS. Because it is asked to decode the following: `{args: {raw_player_response:window.ytplayer.bootstrapPlayerResponse} }; if(window.ytcsi)window.ytcsi.tick("cfg",null,"")}` [download] that could be the wrong response signifying an outdated scrapper (likely). On the other hand, it looks to me to be wrong JSON but I am not a JSON expert. The first part is JSON and can be fixed with quoting all strings (no?). The rest looks like broken javascript. It will give you a nice excuse to avoid the seaside and open up a terminal window to crack it ... bw, bliako #EDIT: here is what lies in line 298 sub _get_args { my ($self, $content) = @_; my $data; for my $line (split "\n", $content) { next unless $line; if ($line =~ /the uploader has not made this video available i +n your country/i) { croak 'Video not available in your country'; } # The following regex looks like it is asking for trouble # memo-to-self: can't parse javascript with regex... elsif ($line =~ /^.+ytplayer\.config\s=\s(\{.*})/) { print STDERR "BBBBB: \|\|\|$1\|\|\|\n"; ($data, undef) = JSON->new->utf8(1)->decode_prefix($1); # +<<< 298 last; } } croak 'failed to extract JSON data' unless $data->{args}; return $data->{args}; } [download]	[reply] [d/l] [select]
Re^2: youtube parser/scrabber by bliako (Monsignor) on Aug 19, 2021 at 12:47 UTC
I have added a `;` at the end of said regex and now have this: `^.+ytplayer\.config\s=\s(\{.*?};)` For this particular use-case the above regex extracts the JSON. Although JSON's `decode_prefix()` will ignore any trailing non-JSON (e.g. the Javascript I mentioned) content. Now, regarding the problem of unquoted keys and values. There is a `allow_barekey()` option to the JSON parser which will allow keys not to be quoted. And you need to deal with the remaining problem of unquoted values. Unquoted values may be indicative of a much bigger problem: that values in the "JSON" (which is actually a Javascript hash) are function calls or other hash values, variables etc.! For example, this is the line that `_get_args()` looks for: `if(createPlayer){ if(window.ytplayer.bootstrapPlayerResponse){ window.ytplayer.config={args:{raw_player_response:window.ytplayer. +bootstrapPlayerResponse}}; ...` [download] There is a reason why it is unquoted I think ... So, yes the scrapper looks outdated (though very recently updated) and you are better off using something else. bw, bliako	[reply] [d/l] [select]
Re: youtube parser/scrabber by Anonymous Monk on Aug 18, 2021 at 13:09 UTC
Its not broken ;) just outdated ;P scrapers always have a short shelf life, the work always stalls , its a rare project that gathers an active community Re: starting flash video from script (get-flash-videos + RTMPDump)	[reply]


Think about Loose Coupling
	PerlMonks