Amazon Web Services have many benefits. Whether you choose it for the pay as you go pricing, high performance, and speed, or its versatile and flexible services provided, it has you covered. Consequently, companies, when building their data analytics stack, choose to go to AWS and Amazon Redshift. It wouldn’t be wrong to say that Amazon Redshift has made secure data storage simpler, more efficient, and a whole lot cheaper.
At the same time, companies use databases like MySQL or PostgreSQL to build their products on top. Even products like Magento, WordPress, and WooCommerce have MySQL in their tech stack.
Given these points, it is an everyday use case of companies that need to move their data from MySQL to Redshift or any other “analytics-friendly” data warehouse.
So, in this article, we are going to examine how you can ETL MySQL data to Amazon Redshift.
We’re going to look at some data loading concepts to Redshift and the best practices you can follow, that come from our experience.
How to Integrate MySQL with Amazon Redshift
The different methods for MySQL to Amazon Redshift Data Loading are:
- Copy Command method (Dump and Load)
- AWS data pipeline
- RDS sync
- The ready to use way (working with an ETL/ELT vendor)
It all comes down to your specific data warehouse needs.
For an easy and fast, ad hoc data upload, copy command would be ideal. But if you wish to update your data recurrently, RDS sync or AWS Data Pipeline can qualify.
Let’s see a quick comparison matrix on the benefits and disadvantages of each option.
|COPY Command||RDS Sync||AWS Data Pipeline||ELT vendor (Blendo)|
|Transform data on the fly||X||Yes||Yes||ETL vs. ELT|
Depending on your use case and resources, there are great choices to go to.
1. Copy Command method (Dump and Load)
Starting, a relatively easy way to ETL your MySQL data into Amazon Redshift is using the COPY command that loads dump flat files (CSV, JSON) into Redshift. You need first to bring your MySQL database into a flat file, upload files into Amazon’s S3 product and then issue a copy command.
There’s a native copy command that you can run from the redshift query executioner face called the COPY command:
That allows you to pull in data from a specific object into a specific table that you’ve created in Amazon Redshift.
It is quite crucial after pulling data into Redshift from an S3 bucket to verify the data load by performing some test queries. It would be useful to take into consideration the consistency model of S3 while copying your data into S3. You can check more information in AWS’s documentation here.
Running COPY command, executes automatic compression encoding on each piece of data pulled in and it works like this:
Analysis of the data —> Sampling of columns —> Selecting best possible compression method —> Applying compression into the Redshift table after data loading.
If you have small tables and want an ad hoc replication of your MySQL database to Redshift, the dump and load approach is the right fit for you.
2. AWS Data Pipeline
Amazon offers a separate service called Data Pipeline with two main functions. You may process and move data between different AWS and storage services, as well as on-premises data sources, at specified intervals (your data center or inside of Amazon itself).
It’s completely agnostic as far as where we’re pulling the data from and where we’re putting it to, which also applies to data processing.
A simple data pipeline model is the following:
- Pulling data from an EC2 instance somewhere on Amazon’s infrastructure.
- Daily, we move instances into an S3 bucket.
- Store it in the bucket
- Weekly, we collect some of them.
- Move them into an EMR cluster.
Upon finishing processing, we can dump our data into another S3 bucket.
The data pipeline flows can be simple as the one above, or they can also get incredibly complex.
Here’s an example of a data pipeline that could be extremely complicated:
Remember that there’s no limit to the number of steps you could have in the pipeline. You could have all of your data stored in an existing database. Before pulling them into an S3 bucket, you want to do some preprocessing or an analysis task and dump the output of the analysis into an S3 bucket and so on.
The data pipeline can support this full flow of steps and is a handy service you could use, if you are interested in performing analysis of your data on the fly or if you plan to use Redshift extensively.
For close to real-time replication and large datasets, as well as regular updates, AWS Data Pipeline, is the way to go but with high complexity and maintenance.
3. RDS sync
RDS sync isn’t a service that’s offered natively from Amazon, but you can build it at your own data center. It takes quite a lot of effort and maintenance, but the results are worth it.
If built correctly, you can transfer data in real-time, without any data loss while performing advanced transformation along the way. That is the main reason why many ETL tools are provided, offering self-service ETL – MySQL to Amazon Redshift integration.
But let’s explain a bit how it works. We have a MySQL database (locally or stored in RDS or any other Amazon service), so it is dumping bin logs out.
From this, we can dump our bin logs through “a little helper agent” that runs inside of our own data center. The agent is taking those packets of data and processes them up. The data flow moves into the second tier.
That data is put into Redshift, and from Redshift, we can query away and perform analysis or reporting. That is the main advantage of this method, compared to the dump and load method we were talking about earlier.
We recommend RDS sync for large or/and messy data when you want to perform transformations and cleaning.
4. An ELT vendor
With this approach, you can combine the best of all the solutions we have described above. You get:
- Minimal stress to your database.
- Control over the data you want to replicate.
- No scripts or sophisticated MySQL ETL tools.
- The ability to ingest any other cloud data source (like Salesforce, NetSuite and more connectors)
- Minimal implementation time.
So here is where a solution like Blendo fits.
Blendo is a data infrastructure as a service solution. It is an ELT platform that enables organizations to pull together data from disparate silos into their data warehouse and reduce data movement. As a result, organizations can accelerate queries without having to manage the underlying infrastructure.
Blendo is an Amazon APN Technology partner. | Click here and download our free ebook: Amazon Redshift Guide for Data Analysts
Blendo is a MySQL integration tool that can help you ingest data from MySQL, along with other data sources and all critical business data into any database. In the example, we discuss Amazon Redshift.
With Blendo, you can have a MySQL Redshift data pipeline up and running smoothly without the need for advanced technical resources. You can integrate any data source within minutes, without having to write any code. Blendo is designed to offer self-serve analytics-ready data in modern data warehouses. Don’t hesitate to contact us and try Blendo for free.