Talking about Twitter data extraction we should discuss using the a ready one or creating your own Twitter crawler. The differences between a web scraper and a crawler are in the measure of dataset and the depth of data we get.

Doing web crawling, we iteratively find and fetch web links starting from a list of seed URL’s or search requests, to get, for example, a list of Twitter profiles links or tweets. While web scraping, we extract information out of each profile or tweet from this list. These terms are closely connected and crawling is not possible without scraping.

Although some completed solutions to crawl Twitter data – like Octoparse or Twitter Scraper from Scraping Expert – already exist, in this article we will get closer to creating our own web spider, and take a quick overview of the third party Python scripts and talk about how to reduce the risk of blocking by Twitter.

Twitter crawler Python based

Packages like Tweepy or Twython offer you the tools and make the creating of a Twitter web crawler easier, but it would be helpful to get closer to some practical solutions, especially if the author gives you the a brief overview of his code in the a separate article.

Solution 1. Author: Amit Upreti (Towards Data Science) – a nice solution with Python and Scrapy to crawl tweets by hashtag without authentication and appealing to official Twitter’s API. Xpath is used here to find tweets on the mobile version of the search results page and to iterate through the pages by following the link on “Load older Tweets” button. You will need the a special plugin for your browser to see this button. For Google Chrome install, e.g., User-Agent Switcher for Chrome and select “Android KitKat” in the drop- down menu of the plugin, otherwise you will see the desktop version of Twitter even if you open the mobile link. With this plugin you can explore the HTML structure in DevTools and work with the same selectors as the author.

Solution 2. Author: Dea Venditama (Chatbots Life) – a short 30-lines example of creating tweets crawler by hashtag with Tweepy. If you are lucky to get approved and receive the authentication data from Twitter to use the official API, this crawler will show you the basics of Tweepy and you will only have to write the an algorithm of saving tweets in any desirable format to complete this crawler.

Solution 3. Author: Dea Venditama (Medium) – the an article, showing how complex it could be to crawl through all the tweets of the a specific account, if the account has thousands of tweets. The first attempt was made using the package twitter_scraper package, but it seemed impossible to get more than 800 tweets. The second attempt was made writing the a custom code with BeautifulSoup – the same result and the same limit. 3200 tweets from the same account were retrieved with the help of the online tool Vicinitas just to check if it is possible to overcome the limit of 800. In spiteDespite the fact that the author leaves the issue of crawling all tweets from an account unsolved, some of his code may be interesting to those who works on the same problem.

As you run into the limits of official Twitter API or of the any other tools writing the a Twitter crawler Python based, some third party API can help you. Twitter Scraping and Crawling tool from Proxy Crawl – the a tool that promises to scrape millions of search results securely. You just send requests in Python and focus on received data.

Twitter crawler User agent, how to change (Python example)

User-agent is a part of the request headers, representing the information about the a client (browser or bot) that sends the request. Google Chrome on Windows 10 has the following User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36.

For some tasks User-agent is not necessary at all, for the others it is enough to inspect the headers of the site with DevTools and manually copy-paste the User-agent value into your script (open Network tab and find the specific resource with the data you need). Or just take the valid and up- to- date User-agent here.

But However, many IT-managers protect their servers by blocking the standard Python request headers that are sent by default if no User-agent was defined and by monitoring the number of requests from the unique combination of IP address and User-agent.

In some cases Twitter will allow you to send the single request without any custom headers:

import requests


r = requests.get(‘https://mobile.twitter.com/hashtag/spider’)

But your crawler sending thousands of requests this way will be blocked fastin no time after sending thousands of requests this way.

Besides using proxies, changing or randomly selecting of User-agents is one of the most effective methods to prevent blocking your crawler by the server and getting error the “HTTP 503 Service Unavailable” error.

The most obvious way is to prepare a list of the most common User-agents (make it longer in a real project!) and randomly pass one of them into headers.

from random import choice
import requests


user_agents = [
    ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36’,
    ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36’,
    ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36’,
    ‘Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0’,
]

Before each request, we change User-agent like this and minimize the probability of blocking our crawler.

headers = {
    ‘User-agent’: choice(user_agents),
    ‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8’
}

r = requests.get(‘https://mobile.twitter.com/hashtag/spider’, headers=headers)

The list of User-agents should be updated from time to time as the new versions of browsers are released. Including hundreds of old and irrelevant User-agents into this list is a bad idea, because the proxy changes not so often as the request headers. A high number of requests with the same IP and a dozens of different User-agents look suspiciously for the server too.

To get rid of the headache with of changing and updating the list, you can use a fake-useragent Python package. It delivers you the an up- to- date User-agent string with the latest browser version, so your list will look like this:

from fake_useragent import UserAgent


ua = UserAgent() # more details at https://pypi.org/project/fake-useragent/
user_agents = [ua.ie, ua.opera, ua.chrome, ua.safari]

But we can make the full code even shorter if it doesn’t matter what browser we want to imitate, just passing ua.random inside headers:

import requests
from fake_useragent import UserAgent


ua = UserAgent()
headers = {
    ‘User-agent’: ua.random,
    ‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8’
}

r = requests.get(‘https://mobile.twitter.com/hashtag/spider’, headers=headers)

As you can see, randomizing User-agent is really easy with Python!

One Response

Leave a Reply

Your email address will not be published. Required fields are marked *