Export Data from Amazon Redshift
Equally important to loading data into a data warehouse like Amazon Redshift, is the process of exporting or unloading data from it. There are a couple of different reasons for this.
First, whatever action we perform to the data stored in Amazon Redshift, new data is generated. This data to be useful and actionable should be exported and consumed by a different system. Data is exported in various forms, from dashboards to raw data that is then consumed by different applications.
Second, you might need to unload data to analyze it using statistical methodologies or to build predictive models. This kind of applications requires from the data analyst to go beyond the SQL capabilities of the data warehouse.
In this chapter, we see how data is unloaded from Amazon Redshift and how someone can directly export data from it using frameworks and libraries that are common among analysts and data scientists.
How to Export Data from Redshift
The COPY command is the most common and recommended way for loading data into Amazon Redshift. Similarly, Amazon Redshift has the UNLOAD command, which can be used to unload the result of a query to one or more files on Amazon S3.
The data is unloaded in CSV format, and there’s a number of parameters that control how this happens.
- Manifest. This parameter indicates to Amazon Redshift to generate a Manifest file in JSON format, listing all the files that will be produced by the UNLOAD command.
- Delimiter. Specifies the delimiter to use in the CSV file.
- Encrypted. Specifies that the generated on S3 files will be encrypted using the AMAZON S3 server side encryption.
- BZIP2 or GZIP. Indicates that the unloaded files will be compressed using one of the two compression methods.
- NULL. NULL indicates which character to be used to represent NULL values. It is important if you perform further analysis on the data.
After a successful invocation of the UNLOAD command, the data will be available on S3 in CSV which is a format friendly for analysis but to interact with the data someone has to access it on S3.
How to Read Data from Amazon S3
The UNLOAD command gets your data into Amazon S3 so that you can work with it after its extraction from Amazon Redshift. Now you need somehow to interact with S3 and access your files.
Amazon AWS SDKs
Boto3 is a generic AWS SDK with support for all the different APIs that Amazon has, including S3 which is the one we are interested. Download a file using Boto3 is a very straightforward process.
Import boto3 s3 = boto3.resource('s3') with open('filename', 'wb') as data: s3.download_fileobj('mybucket', 'mykey', data)
Of course, it is possible to read a file directly into memory and use it with all the popular Python libraries for statistical analysis and modeling. It is advised, though, that you cache your data locally by saving into files on your local file system. Otherwise, every run of your program will require downloading the data from S3 over the network, something that adds additional latency and cost to your operations.
As we see, all it takes to download a file fromS3 is 4 lines of code, and it requires to know the bucket where your files exist, the name of the file you want to download and the key to use as credentials.
R is another popular choice for Analysts and Data Scientists when it comes to a language for scientific and statistical computing. Unfortunately, there’s no official AWS SDK for R, but there are a few options out there to help you interact directly with your data stored on S3.
A common way for R to interact with the AWS APIs is by invoking commands directly to the AWS CLI using the R System() call. To do that, you need to install the CLI and then invoke the commands you want, e.g. get an object from S3.
Another option is to use the Cloudyr package for S3. With it, download and working with files on S3 is just a one line command inside your R code. Just make sure that you have configured your credentials correctly for accessing your Amazon S3 account.
Reading Data directly from Amazon Redshift
So far we have seen how we can unload data from Amazon Redshift and interact with it through Amazon S3. This method is preferable when working with large amounts of data and you have concluded to the shape of the data that you would like to work.
There are cases where interacting directly with Amazon Redshift might be more desirable. For example during the process of exploring your data and deciding on the features that you would like to create out of it.
In this case, waiting to go through S3 every time you change something to the queries you work with adds much delay to your analysis.
To access your data directly on Amazon Redshift, you can use the drivers for PostgreSQL that your language of choice has.
For Python, you can use Psycopg which is the library recommended by PostgreSQL. The same can also be used to access your Amazon Redshift cluster and execute queries directly from within your Python code. After you have established a connection with your Amazon Redshift, you can work with the data using either NumPy or Pandas.
In a similar way to Python you can also interact with your Redshift cluster from within R. All it takes is to include the “RPostgreSQL” package to your code, and then you can execute queries directly to your data warehouse and pull data out of it and use it within your code.
You can find more information here on how to access your data in Amazon Redshift with Python and R.
Getting started with PostgreSQL in R has some great material that might be helpful also with Amazon Redshift