Load Data Into Amazon Redshift
To carry out the most typical scenario for loading data into Amazon Redshift, we use another Amazon AWS resource as a staging entity for the process. E.g., you load data into your cluster through Amazon S3 or Amazon DynamoDB. To do that an essential prerequisite is to set up the right permission and roles to make these resources accessible to Amazon Redshift.
Access Rights and Credentials
To grant access to an Amazon Redshift instance to access and manipulate other resources, you need to authenticate it. There are two options available:
- Role Based Access. Your cluster temporarily assumes an AWS Identity and Access Management (IAM) role on your behalf.
- Key Based Access. You provide the Access Key ID and Secret Access Key for an IAM user that is authorized to access the AWS resources that contain the data.
Role Based Access
Role Based Access is the recommended by Amazon access option. It is more secure and offers fine-grained control of access to resources and sensitive data. To use a Role Based Access Control, you first have to create an IAM Role using the Amazon Redshift Service Role type and attach that role to a cluster. The minimum number of permissions the role must have is the following:
- CREATE LIBRARY
For a step by step guide on how to create an IAM Role and attach it to a cluster, please see this guide from Amazon.
After the creation of an IAM Role, an ARN for this particular role returns. The COPY command that loads data into the cluster uses this ARN as a parameter.
Key Based Access
To gain this type of access, you need to provide the Access Key ID together with the Secret Access Key for an already existing IAM User. That user must have the authorization to access a particular resource, from which data load into the Amazon Redshift cluster.
It is strongly suggested to use Role Based Access instead of a Key Base one, for security reasons. If for any reason you prefer to use the combination of a key and a secret, then make sure you do not use keys that associate with your root account. Instead, create a new IAM User following this guide.
The newly created IAM User should have at least the following permissions:
- CREATE LIBRARY
Amazon Redshift supports loading data from different sources, and the methods to use are mainly two:
- Use of the COPY command
- Using DML* (INSERT, UPDATE, and DELETE) Commands
(* DML = Data manipulation language – Wikipedia)
As Amazon Redshift is built on top of a PostgreSQL clone, you can connect to a cluster using the available JDBC and ODBC drivers and perform queries using DML commands. This technique should be avoided, even if it feels familiar to a data analyst.
It is not suggested to use this method of importing data, executing bulk INSERT commands over a JDBC or ODBC, does not perform well with large amounts of data.
On the contrary, using the COPY command is the suggested way to load large volumes of data on an Amazon Redshift cluster. In this way, we utilize the full potential of the MPP (Massive Parallel Processing) of Redshift, and data loads more efficiently and faster.
To ensure that the cluster is loading data to its full capacity you should:
- Split your data into multiple files, e.g. if you are using a source like Amazon S3 or multiple hosts
- Define the best Table Distribution Style for your data
Sources to Load your Data
The COPY command supports a wide number of different sources to load data.
The first and most common source is Amazon S3. There you can load data in CSV or JSON serialization.
Amazon EMR Cluster
Another potential source where COPY can load data is an Amazon EMR cluster. Data on an Amazon EMR Cluster get stored on an HDFS file system from which you can load it in parallel into an Amazon Redshift Instance.
It is also possible to load data (in parallel again) using the COPY command, from an arbitrary number of remote hosts. These hosts might be EC2 instances or any other computer connected to the Internet. The remote hosts should be able to allow access through SSH for COPY to be able to connect and pull the data.
For more information on how to load data from remote hosts, you can check here.
Finally, COPY supports DynamoDB as an input source. So, it is possible to replicate tables from DynamoDB to tables on Amazon Redshift.
COPY command parameters
The COPY command takes some parameters you should be aware. The following three are the required parameters:
- Data source, the place where we pull the data
- Table Name, the destination of where we store the data
- Authorization, credentials to access the data on the data source
There are also some optional parameters which might be helpful for optimization reasons and for making your ETL processes more resilient.
- Data Conversion Parameters are necessary for making your data loading process more resilient to errors. COPY attempts to implicitly convert the strings in the source data to the data type of the target column. Using these parameters, you can explicitly define how to perform the conversions.
- Column Mapping. The default behavior of the COPY command is to insert the data into the table, in the same order found in the input source. If you change this behavior, you should provide an optional column mapping.
- Data Format Parameters. There are some different options, ranging from CSV to AVRO files, that COPY understands. Using these parameters, you can provide to COPY a different type of data serialization.
- Data Load Parameters define the behavior of the loading process. You can change the way errors are handled and reported or the sample size for the selection of a compression method.
The above parameters make the COPY command a very flexible and versatile tool for pushing data into Amazon Redshift. Combined with knowledge about the nature and provenance of the data can lead to an optimal method for consistently loading data into an Amazon Redshift cluster.
Loading data from other sources
Many other sources are not supported by COPY directly. In this case, the recommended approach is to figure out a way to place the data on Amazon S3 and then use the COPY command.
Some notable examples are databases like PostgreSQL and MySQL that might be running on AWS RDS or any other provider. In this case, you need to figure out a way to dump the data from the database and then load it into S3.
For MySQL you can check the following resources:
You can use mysqldump to dump your data into CSV format. Here you can find out how to convert the output of mysqldump into a CSV file.
Here you can also find a small utility written in Python to do the same thing.
In a similar fashion, you can use the psql command to dump tables into CSV format, using the /copy command parameter.
The output of the above commands can then be loaded into S3 and then use the Amazon Redshift COPY command to load the data into the cluster.
Pulling data out of cloud applications is much more complicated. You need to develop a custom connector for each application to pull data out of it. Then, turn it into an appropriate CSV or JSON file and store it into S3.
Alternatively, you can use a service like Blendo to automatically sync the data from cloud applications like CRM and marketing platforms into Amazon Redshift.
Streaming data into Amazon Redshift
So far, all the different methods for loading data into Amazon Redshift are for updating your cluster in batches. That is a natural choice because traditionally, data warehouses were intended to be used to analyze large amounts of historical data.
Currently, there’s an increasing trend towards more real-time analytic applications where data is analyzed in almost real time. To address this new requirement Amazon Redshift, currently, supports the ingestion of data in real time using Amazon Kinesis.
The way to ingest streaming data into Redshift is by using the Amazon Kinesis Firehose which automatically batches and compresses streaming data before it loads it into a Redshift cluster. In this case, batched data are delivered automatically to S3 and then into Redshift with a COPY command that is provided by the user during configuration.
An interesting and useful feature, supported by Amazon Kinesis, is the online transformation of data. It is possible to define custom transformations as Amazon Lambda functions that transform your data before it loaded on a Redshift cluster.
Loading data into Amazon Redshift is probably one of the most fundamental operations that we perform frequently. It’s not uncommon to deploy complex pipelines which load data from many different sources into Redshift. Many things can go wrong in such a complex procedure, so here are a few things that might help you figure out what is going wrong.
The data to be loaded should be in UTF-8 encoding. No other encoding is supported, and if you try to load a different one, you might end up with errors. If for any reason you end up having invalid characters in the data you are going to load; then you might find helpful to add the ACCEPTINVCHARS parameter to the COPY command.
Make sure CHARS and VARCHARS conform to the max size of the corresponsive columns. The size of strings is measured in bytes and not characters.
If you are using any special characters in your CSV files, make sure to define an escape character as a parameter to your COPY command.
If your CSV has headers, use the IGNOREHEADER parameter for the COPY command.
All errors that occur during the data loading phase are stored into the STL_LOAD_ERRORS system table. Consult this table whenever you get an error as it contains useful error messages.