Processing Your Data in Python with Jupyter Notebook

By Alastair Beeson

Python for Data Science

Python has quickly become the most popular coding language for doing Data Science and Analytics. It has numerous open source libraries that handle everything from mathematics, quantitative economics, graphing, predictive modeling, natural language processing and more.

Using Jupyter Notebook To Process Data

One of the best ways to do simple data science projects or process data in Python is using Jupyter Notebook. Jupyter Notebook is a web based interactive computation environment used to create notebook documents. A document is a REPL made up of an ordered list of input and output cells. You can put code, plots, text, media and more in these.

There are a lot of use cases and benefits to notebooks. You can use them for teaching others, sharing code, exploratory data analysis, quick prototypes, data cleaning, and more. You can easily import any libraries you like to process data, create graphs, and query APIs. Once installed, most library imports are as easy as "import numpy as np". Thanks to its document format, it's easy to write about and present your code line by line in a digestible format. Another person can easily run your code and see the results. Notebooks can easily be exported as an HTML file or PDF file.

Thus a Jupyter Notebook is a great way to create, engineer and export datasets for use in VR applications. You can also easily load datasets found online and modify them as you see fit. For instance dropping unimportant columns, creating new columns, performing operations on data values, handling missing values or joining datasets.


To get started with Jupyter Notebook::

Download Jupyter Notebooks: https://jupyter.org/install

Use JupyterLab in Browser: https://jupyter.org/try-jupyter/lab/

Libraries For Data Processing, Visualization, ML

Pandas is an essential library for Python Data Science. It comes with specific data structures and functions for dealing with numerical tables and time series data. Its best feature is the DataFrame data structure which is analogous in many ways to an Excel table. These dataframes are generally created from loading CSV files and can also be exported as CSV files. Some features of Pandas include its ability to easily add and drop columns, merge and join datasets, time series functionality, handling of missing data and more. If you are trying to process data for VR applications, Pandas should be your go to library.

NumPy is a ubiquitous library for doing Mathematics in Python. It mainly concerns performing operations on arrays and matrices in efficient time and comes with many high level math functions. Pandas is built on top of Numpy. One should generally import NumPy as its methods can come in handy is processing and engineering data.

MatplotLib is a visualization library that takes its inspiration from Matlab, a popular piece of software for numerical computing. It aims to be a free and open source alternative with comparable functionality and syntax but also the ability to use Python. It has been around the longest of the Python visualization libraries and other libraries like Pandas and Seaborn are built on top of it. It is not the fastest or exciting for quick data analysis and visualization.

Scikit-learn is an incredibly powerful library for applying machine learning algorithms to data. It can be used for classification, regression, clustering and has many modern techniques like random forests, gradient boosting, k-means and more. This is fastest and easiest machine learning library to use. In some cases, three lines of code is all you need to get a super rudimentary model.

TensorFlow, Pytorch, Keras are other machine learning libraries and frameworks. These are a bit harder to get set up and you would want a bit more modeling experience to get good value out of them. These lean more towards deep learning, academic research and complex modeling. Out of all of these, Keras tends to be the fastest as it is essentially an interface for the Tensorflow library and is the most plug and play out of the three. Thus if you are looking to predict simpler things like food prices or ticket prices, Keras and Scikit are the way to go.