Here we will go through the main technologies of creating YouTube scraper and get closer to some Python script example for YouTube channel data crawling.

What information can be parsed on YouTube

Any data we see on YouTube in our common browser can be parsed:

The main goals of the YouTube scraper should be strictly defined. For example, we can parse:

Widely used YouTube web scraping technologies

Based on our goals and benefits and the technical requirements we chose the methods to scrape YouTube data. Writing Python scripts for YouTube data extraction is a widespread and flexible solution.

For Python developers Google offers the API wrapper – google-api-python-client – that allows interacting with many Google services including YouTube (use google-api-python-client version 1.8.0 to avoid dependencies problems).

If our task does not require a high volume data parsing – we can write, e.g., YouTube comment scraper with selenium, the Python tool that automates browser.

Frequent difficulties in scraping elements on YouTube

In short terms, the difficulties of web scraping YouTube are the following:

Elegant Python example to scrape YouTube channel videos

One of the possible solutions is to use YouTube AJAX service directly to get a response with JSON-data. Let’s imagine that we are going to scrape youtube video and need to get the list of videos from any channel knowing its URL.

youtube scraping elements

Our first step is to receive the valid token. Writing down all the necessary headers parameters, we send the GET request to our channel URL (https://www.youtube.com/c/TED/videos). Token is stored within the script, so we have to split the contents of the page.

import requests


channel_url = 'some youtube channel full link here'

# making request to get 80 chars token
headers = {
    'user-agent': 'some valid UA',
    'x-youtube-client-name': '1',
    'x-youtube-client-version': '2.20200429.03.00',
}
token_page = requests.get(channel_url, headers=headers)
token = token_page.text.split('"nextContinuationData":{"continuation":"')[1].split('","')[0]

Printing out the token_page.text we get some long JS string:

token page

In the middle of this string we find the token:

find the token

Now we have the following dictionary with our token:

# build params dict for the future requests
params = {
    'ctoken': token,
    'continuation': token
}

We will get data in a loop. The condition to stop will be an absence of token for the next request:

# parse data while we have token
stop = False
ids = list()
while not stop:
    pass

Within our loop we make the POST request and parse a JSON response (a little bit complex, do not confuse with the keys):

# post request to get JSON-data
ajax_url = 'https://www.youtube.com/browse_ajax'
r = requests.post(ajax_url, params=params, headers=headers)
r_json = r.json()[1]['response']
channel_data = r_json['continuationContents']['gridContinuation']

Now the variable channel_data contains the part of JSON with the list of videos, but before we deal with it, we need to check if we get the last chunk of data – the indicator is the presence of token in this part of JSON:

# trying to get token for the next request
try:
    token = channel_data['continuations'][0]['nextContinuationData']['continuation']
except:
    # failed to get token -> no requests anymore
    stop = True

And finally we go through the channel_data items and grab videos IDs inside the result list:

# iterate through the channel_data
for item in channel_data['items']:
    video_id = item['gridVideoRenderer']['videoId']
    ids.append(video_id)

The variable item contains the following JSON, relating to the video preview and main info:

video preview and main info

At the top we see the video id – “videoId”: “8atXMqZ_w0M” – that can be used as a part of full link: https://www.youtube.com/watch?v=8atXMqZ_w0M .

By this solution we overcome YouTube scraper API limits and still use JSON data that is fast and easy to work with.

Conclusions

As any other social network YouTube, on the one hand, protects itself from frequent requests mainly with the API limits, and on the other hand, gives developers a good documentation and ready to use libraries. Even with a little knowledge of Python we can find the ways for web data scraping on YouTube by API, from the browser instance with selenium or from AJAX service directly.

Leave a Reply

Your email address will not be published. Required fields are marked *