Twitter API and Python Scrapping Tutorial

By Alastair Beeson

What is and How Can I Get Twitter Data?

Twitter data falls into the category of network data on our scientific data page. Twitter data extends beyond simply the tweets themself. You can also get information like the location of the tweet’s author, the tweet’s reply count, the tweets language, the time the tweet was created at, and the user associated with the tweet.

The easiest way to get tweet data is through Twitter's own free API:

You need to register to get your consumer keys and access tokens which will be needed to use the API and complete the tutorial below.

The API is pretty straightforward to work with and you can use explicit attributes to extract the data you want from the tweet object. You can also easily write scripts to scrape tweets with specific words associated like all tweets about the “Lakers” or “VR”.

However, There are several limitations of the Twitter API that are important to know. First, with the standard free API license, you can only fetch tweets in the last 7 days. This is a huge deal when you are trying to analyze tweet trends over longer periods of time like seeing how tweet volume or sentiment changed before and after a certain event. Likewise you can only request 18000 tweets per 15 minute window. Although this seems like a pretty robust number of tweets, it can be frustrating if you are trying to process a really large volume of tweets or trying to make multiple requests to scrape tweets on different topics.

What Software Can I Use Twitter Data With?

Twitter data has lots of different use cases and variables. You can analyze tweet time, locations, users, text. Thus what software you can use really depends on the types of variables you are analyzing. Obviously numerical data like tweet volume or time will be compatible with most graphing software. In the case of actual tweet content, you would likely have to apply NLP methods to make qualitative data quantitative. The tutorial below will demonstrate how to turn your opinionated tweets into numerical scores that you can analyze and classify.

Twitter API Query, Cleaning and Analysis Tutorial

In this tutorial we will use Python and its library Tweepy to access the Twitter API and scrap tweets to be useful for exploratory and sentiment analysis. By using Python, we have access to all sorts of libraries beyond Tweepy like Pandas which will make it easy to create, engineer and export a dataset from our API queries.

This tutorial will allow you to query the CNN twitter account for their tweets and load the data as a pandas dataframe. This will use Tweepy and Pandas as well as a few other auxiliary libraries to help out.

The best way to use this demo is through a Jupyter Notebook. If you don't have it installed on your computer, use to run this demo in your browser. I recommend putting each Step in its own individual cell.

Step 0: Install your libraries for this tutorial.

First you will likely need to install some of your libraries. Even Jupyterlab only has the basic ones like Pandas and Numpy preinstalled. Obviously you will need Python installed for this step as every Python installation comes with pip or Package Installer for Python.

Inputting the following in your command line should install most of the libraries needed.

python -m pip install --user numpy scipy matplotlib ipython jupyter pandas tweepy re

These lines will install Textblob and its corpus:

pip install -U textblob

Python -m textblob.download_corpora

Also import all of your libraries in your first notebook cell with the following:.

import tweepy # to use Twitter’s API

import re # regex for cleaning the tweets

import numpy as np # python math library

import pandas as pd # to create and manipulate dataframe objects

from textblob import textblob # for performing sentiment analysis on your tweets

import scipy as sp # another python math and science library

import base64 # needed for exporting your dataframe as csv

from IPython.display import HTML # needed for exporting your dataframe as csv

Step 1: Start by defining your keys and access tokens. You will need to sign up for the API to get customized ones for your specific account:

Consumer_key = “your consumer key”

Consumer_secret = “your secret consumer key”

Access_token = “your access token”

Access_token_secret = “your secret access token”

Step 2: Define your twitter object. This includes your authentication object, your access tokens and your API object which you can later query.

def twitter():

# Creating the authentication object

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

# Setting your access token and secret

auth.set_access_token(access_token, access_token_secret)

# Creating the API object while passing in auth information

api = tweepy.API(auth, wait_on_rate_limit = True)

return api

Step 3: Next you want to instantiate your twitter object, and then create a search variable that queries a specific user timeline which in this case is CNN. Then you can use the following to iterate through your search and print out their 10 most recent tweets.

# Creating tw object

tw = twitter()

# Extracting CNN tweets

search = tw.user_timeline(screen_name="cnnbrk", count = 200, lang ="en")

# Printing last 10 tweets

print("10 recent tweets:\n")

for tweets in search[:10]:

print(tweets.text + '\n')

Step 4: Next you want to create a Pandas dataframe to store your twitter data. This specific implementation will create a Pandas dataframe with two columns: Tweets and Time.

df = pd.DataFrame([tweets.text for tweets in search], columns=['Tweets'])

df['Time'] = pd.DataFrame([tweets.created_at for tweets in search])

Step 5: Next if you want to perform any sentiment analysis techniques or corpus evaluation on your twitter data, you will want to remove any symbols, hyperlinks, mentions, retweets and more that would clash with analysis methods. You then apply your cleaning method to your Pandas dataframe.

# Cleaning the tweets

# Creating a function called clean. removing hyperlink, #, RT, @mentions

def clean(x):

x = re.sub(r'^RT[\s]+', '', x)

x = re.sub(r'https?:\/\/.*[\r\n]*', '', x)

x = re.sub(r'#', '', x)

x = re.sub(r'@[A-Za-z0–9]+', '', x)

x = re.sub(r':', '', x)

x = re.sub(r'"', '', x)

return x

df['Tweets'] = df['Tweets'].apply(clean)

Step 6: Now that your data is cleaned, let's define a few methods to calculate sentiment analysis scores for your tweets. We are using the Textblob sentiment analysis library.

def detect_sentiment(text):

return TextBlob(text).sentiment.polarity

def detect_subjectivity(text):

return TextBlob(text).sentiment.subjectivity

Step 7: Now you will apply your sentiment analysis methods to create new columns in your dataframe for sentiment polarity and subjectivity as well as convert the time column to something more usable.

df.loc[:, 'sentiment'] = df.loc[:, 'Tweets'].apply(detect_sentiment)

df.loc[:, 'subjectivity'] = df.loc[:, 'Tweets'].apply(detect_subjectivity)

df['Time'] = pd.to_datetime(df['Time'])

Step 8: This step allows you to export your dataframe as a CSV file to be loaded into visualization applications, into Excel and converted to JSON and more

import base64

import pandas as pd

from IPython.display import HTML

def create_download_link( df, title = "Download CSV file", filename = "data.csv"):

csv = df.to_csv()

b64 = base64.b64encode(csv.encode())

payload = b64.decode()

html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'

html = html.format(payload=payload,title=title,filename=filename)

return HTML(html)


Congratulations! You have successfully registered for the twitter API, queried the CNN twitter feed, created a dataset of tweets, performed simple sentiment analysis and exported your dataset as a csv file. There is a lot more you can do with the twitter API, more information to be queried and even more ways to evaluate and analyze your tweets.