How to ETL MySQL data to Amazon Redshift – The DIY Approach

Blendo Team

Amazon Web Services have many benefits. Whether you choose it for the pay as you go pricing, high performance, and speed or its versatile and flexible services provided, we are here to present you the best data loading approaches that work for us. Today we are going to examine how to ETL MySQL data to Amazon Redshift.

We’re going to look at some data loading concepts to Redshift and the best practices you can follow, that come from our experience. This post is a part of a series that will entail more details on the methods and options you have. After this small intro let’s cut to the chase.

The different methods for MySQL to Amazon Redshift Data Loading are:

  • Copy Command method (Dump and Load)
  • AWS data pipeline
  • RDS sync

It all comes down to your specific data warehouse needs.

For an easy and fast, ad hoc data upload, copy command would be ideal. But if you wish to recurrently update your data, RDS sync or AWS Data Pipeline can qualify.

In order to transform your data while loading them to Redshift, we recommend RDS sync and for real-time replication and analytics, RDS Sync and Pipeline, are both respectively great choices.

1. Copy Command method (Dump and Load)

Starting off, the quickest and easiest way to ETL your MySQL data into Amazon Redshift is using the COPY command that loads dump flat files (CSV, JSON) into Redshift. You need to first bring your MySQL database into a flat file, upload files into Amazon’s S3 product and then issue a copy command.

There’s a native copy command that you can run from the redshift query executioner face called the COPY command:

This basically will allow you to pull in data from a specific object, into a specific table that you’ve created in Amazon Redshift.

It is quite crucial after pulling data into Redshift from an S3 bucket, to verify the data load by performing some test queries. It would be useful to take into consideration the consistency model of S3, while copying your data into S3. You can check more information in AWS’s documentation here.

Running COPY command, executes automatic compression encoding on each individual piece of data pulled in and it works like this:

Analysis of the data —> Sampling of columns —> Selecting best possible compression method —> Applying compression into Redshift table after data loading.

If you have small tables and want an ad hoc replication of your MySQL database to Redshift, the dump and load approach is the right fit for you.

2. AWS Data Pipeline

Amazon offers a separate service called Data Pipeline with two main functions. Reliably process and move data between different AWS and storage services, as well as on-premises data sources, at specified intervals (your own data center or inside of Amazon itself).

It’s completely agnostic as far as where we’re pulling the data from and where we’re putting it to, which also applies to data processing.

Image: AWS Documentation

A simple data pipeline model is the following:

  1. Pulling data from an EC2 instance that were somewhere on the Amazon’s infrastructure.
  2. On a daily basis, we move instances into an S3 bucket.
  3. Store it in the bucket
  4. On a weekly basis, we collect some of them.
  5. Move them into an EMR cluster.

Upon finishing processing we can dump our data into another S3 bucket.

The data pipeline flows can be simple as the one above, or they can also get incredibly complex.

Here’s an example of a data pipeline that could be extremely complicated:

Image: Guidebee

Remember that there’s no limit to the number of steps you could have in the pipeline. You could have all of your data stored in an existing database and before pulling them into an S3 bucket you want to do some preprocessing or an analysis task and dump the output of the analysis into an S3 bucket and so on.

Data pipeline is able to support this full flow of steps and is a very useful service you could use, if you are interested in performing analysis of your data on the fly or if you plan to use Redshift extensively.

For close to real-time replication and large datasets, as well as regular updates, AWS Data Pipeline, is the way to go.

3. RDS sync

RDS sync isn’t a service that’s offered natively from Amazon but you can build it at your own data center. It takes quite a lot of effort and maintenance but the results are worth it.

If built properly, you can transfer data in real time, without any data loss while performing advanced transformation along the way. This is the main reason why many ETL tools are provided, offering self-service ETLMySQL to Amazon Redshift integration.

But let’s explain a bit how it works. We have a MySQL database (locally or stored in RDS or any other Amazon service), so it is dumping bin logs out.

From this, we can dump our bin logs through “a little helper agent” that runs inside of our own data center. The agent is taking those packets of data and processes them up. The data flow moves into the second tier.

That data is put into Redshift and from Redshift, we have the ability to query away and perform analysis or reporting. This is the main advantage of this method, compared to the dump and load method we were talking about earlier.

We recommend RDS sync for large or/and messy data when you want to perform transformations and cleaning.

 

What’s Next?

We decided to give each method a try, so stay tuned for the upcoming blog posts where we discuss thoroughly and implement each method. You may also subscribe below to receive our blog post updates.

Get our latest posts in your email to stay on top of everyone.

Resources

Check out these helpful resources from AWS for further reference:

Leave a Comment