So you found a way to store your data in Amazon Redshift and keep them in sync. Now you want to start messing with it using statistical techniques, maybe build a model of your customers’ behavior, or try to predict your churn rate.
To do that, you will need to extract your data from Redshift and use a framework or language that is best suited for data analysis and the most popular so far are Python and R.
In this small tutorial, we will see how we can extract data that is stored in Amazon Redshift to load it with Python or R. Then use the numerous analytic libraries and algorithms that exist for these two languages.
What is Amazon Redshift?
Amazon Redshift is a petabyte-scale, massively parallel data warehouse. It is a fully managed with virtually unlimited storage and computing power, thanks to the cloud, and that too at an affordable rate starting at a mere 0.25$/hour.
Redshift is compatible with PostgreSQL, so in general, you can safely assume that you can access Redshift using the official PostgreSQL libraries for the language of your choice. That means what you read, more or less, in this guide also applies for the case where you want to use data coming from a PostgreSQL database instance. For JDBC and ODBC drivers you can check the documentation here.
Before we continue…
Setting Up Your Amazon Redshift/PostgreSQL ETL Data Pipelines
The post considers you already have your data in Redshift or PostgreSQL. In case you don’t, there is a way you can do it quickly and solidly.
The best way to load your data to your data warehouse is by utilizing AWS ETL tools that can collect data from all customer touchpoints and load that into Amazon Redshift or PostgreSQL with minimal effort.
For example, with Blendo, you can connect Amazon Redshift or PostgreSQL as the destination warehouse, select your data sources, and have analytics-ready data loaded into the destination just like that.
Although Blendo is an ELT tool (ETL vs. ELT), it performs some transformation during loading to reduce the time between extracting data from multiple sources and gaining data-driven insights by a whopping 90%. Such AWS ETL tools can enable you to set up a data pipeline within minutes and easily manage it from then onwards.
When it comes to Python, the most popular libraries for data analytics are the following:
- NumPy which is probably the most fundamental library in Python for scientific computing.
- pandas which is a Python Data Analysis library which provides high-performance data structures for operating with table-like structures.
No matter what kind of analysis you wish to do, from linear models to using frameworks like Tensorflow, you will need to use one of the above libraries to represent your initial data. For this reason, in this post, we will see how to connect to Amazon Redshift and load our data in structures of NumPy and pandas.
Connect to Redshift with Python
To pull data out of Redshift, or any other database, we first need to connect to our instance. To do that we need to use a library or driver for Python to connect to Amazon Redshift. You can use any of the available libraries for Python, the one that PostgreSQL recommends is Psycopg. We will use this one for this tutorial but feel free to experiment with any others you might want.
import psycopg2 con=psycopg2.connect(dbname= 'dbname', host='host', port= 'port', user= 'user', password= 'pwd')
The above code is all we need to connect to our Redshift (or PostgreSQL) instance with Python using the Psycopg library. By connecting, we get back a connection variable with which we can start executing queries to pull data out of our database.
Of course, it is highly recommended that if you intend to use this code as part of your pipeline, to wrap it in a function that will also handle any errors. The parameters that you need are the typical ones for databases:
- Database name
- User name
Execute queries with Python using Psycopg
When you get your DB connection, you need to start pulling data. To do that you will need to execute some SQL query to narrow down the amount of data you will get based on the type of analysis you want to perform. To perform that with Psycopg, you need to make the following steps:
- First, get a cursor from your DB connection:
cur = con.cursor()
- Execute a select query to pull data where
tableis the table you want to get data from :
cur.execute("SELECT * FROM `table`;")
- After the successful execution of your query, you need to instruct Psycopg how to fetch your data from the database. There are plenty of options there, depending on how you plan to process your data. In most cases in analytics, you will need the complete dataset for analysis so the most appropriate command would be the following:
- Next, do not forget to close your cursor & connection
The most important part here is the SQL query that you execute. SQL is a compelling and expressive language, and you can do a large part of your data pre-processing using it.
For example, you might join different tables, or even create new features by using the aggregation functions that are supported by Redshift and PostgreSQL.
Load data to NumPy
To turn your data into a NumPy array is pretty straight forward. All you need to do is to initialize a new NumPy array passing as a parameter our query results.
import numpy as np data = np.array(cur.fetchall())
cur is the cursor we have created previously. That’s all, your data from Redshift as a NumPy array 🙂
Load data to pandas
If instead of NumPy you plan to work with pandas, you can avoid using the previous steps altogether. You can use the read_sql method with which you can read an SQL query or database table directly into a DataFrame. To do that you will also need to use SQLAlchemy.
from sqlalchemy import create_engine import pandas as padas engine = create_engine('postgresql://scott:tiger@hredshift_host:5439/mydatabase') data_frame = padas.read_sql_query('SELECT * FROM `table`;', engine)
With the above code, you will end up having a pandas data frame that contains the results of the SQL query that you have provided to it.
If Python is not your cup of tea and you prefer R instead, you are still covered. Getting your data from Amazon Redshift or PostgreSQL is equally easy as in Python. As in Python, we again need to first take care of how we will connect to our database and execute queries to it. To do that we will need the “RPostgreSQL” package.
With the above code, we make the package available and visible by R. Now we need to proceed in creating connections and executing queries. In the end, the command to close the connection, don’t do that before you execute queries though :).
Always remember to close the connection when you are done with pulling data out of the database.
drv <- dbDriver("PostgreSQL") con <-dbConnect(drv,dbname="dbname",host="host",port=1234, user="user",password="password") dbDisconnect(con)
Getting your data from the database into an R data frame is just one command away:
df_postgres <- dbGetQuery(con, "SELECT * from `table`")
You may also want to read: If you are looking to access your data from Google BigQuery with Python and R.
As you can see, getting your data from Redshift or PostgreSQL for further analysis in Python and R is easy. The real power of a database that stores your data in comparison with CSV files etc. is that you have SQL as an additional tool.
Investing some time learning how to work with SQL will give you results soon. Having structured data in a database and using SQL to pre-process your data before you start building your statistical models will save you time and resources.
Although in this article, we focused mainly on Amazon Redshift and PostgreSQL, using any other database is equally easy. The main difference will be the selection of the appropriate library for Python and R.
The usage of SQLAlchemy by Pandas makes it easier for you as the only change required is the configuration string for the database engine.
Would love to hear your thoughts in the comments bellow.