How to setup a Redshift cluster
In this section, we will see how to setup a new Amazon Redshift cluster. Apparently, there are many things you may set up, but the main idea is as follows.
Your Amazon AWS dashboard
If you have an account in AWS, you can log in here (You may set up a new one too).
After your login to AWS, navigate to Databases and click on Redshift.
Click on Launch Cluster.
Let’s setup our cluster.
- Cluster Identifier: Give it a name
- Database Name: This optional. You can give one that you want.
- Master User Name: Provide a Username of master user of the cluster
- Master User Password: the Password 🙂
If you are ready, click on Continue.
Selecting Node Type
Now we are in the Node Configuration part of our setup. At this point, we are going to choose between Dense Compute (DC) and Dense Storage (DS). As a reminder:
The Dense Compute cluster provides less storage, but with better performance and speed. The more data you are querying, the more compute you need to keep queries fast. Dense Compute is the type of instance to use if you need a high-performance data warehouse.
The Dense Storage cluster is designed for big data warehouses. So, if you have too much data to fit.
For our example, we will leave the Node Type as is.
Selecting Number of Nodes
You will need to choose the number of nodes that your cluster will work. That number depends on the size of your dataset and your desired query performance.
For example ds2.xlarge Dense Compute nodes have 2TB HDD storage per node. For 12 TB of data, you need 6 ds2.xlarge nodes or 1 ds2.8xlarge nodes. At the same time choosing dc1.8xlarge Compute Node will give you 2.56TB SSD storage per node.
For our case, we chose: 2 x dc1.large Compute Nodes.
Click on Continue.
In this setup, you configure some additional setup and the networking options.
A parameter group is a group of parameters that apply to all of the databases that you create in the cluster. You associate a parameter group with each Amazon Redshift cluster you create. Read more about Parameter Groups in AWS’s documentation here.
You may also enable database encryption for your Amazon Redshift cluster to protect your data. When you enable encryption for a cluster, the data blocks and system metadata are encrypted for the cluster and its snapshots. Read more about Amazon Redshift Database Encryption and the relevant best practices in AWS’s documentation here.
For our example, we leave the default settings*.
*It is advised to read and review Amazon AWS’s documentation for both above cases.
Now let’s configure our networking options.
Virtual Private Cloud (VPC)
Amazon Virtual Private Cloud (Amazon VPC) resembles a traditional virtual network. You may define that virtual network, the VPC, and launch the AWS services you need to run into it. For more details about VPC check AWS’s documentation here.
So as a first step if you have a VPC, then you need to provide it. If you do not have a VPC, a default is created.
For our example, we leave the default*.
*It is advised to read and review Amazon AWS’s documentation for the above cases.
Cluster subnet group
If you are going to provision a Redshift cluster in a VPC, then you need to create a cluster subnet group. You can have multiple subnets that will help you organize your AWS resources. For more details on Amazon Redshift Cluster Subnet Groups check AWS’s documentation here.
Configure the Publicly Accessible option
Configuring this is optional, but if you want to access your Redshift cluster from outside of AWS, then you need to add a public IP by setting Publicly Accessible to Yes. If you want your cluster to be accessible only from within your private VPC network, then choose No.
If you select Yes, then you have the option to select an Elastic IP address (EIP) to use for the external IP address.
An EIP is a static IP address that is associated with your AWS account. You can use an EIP to connect to your cluster from outside the VPC. Read more about EIP in AWS’s documentation here.
The choice of using a public IP or not is up to you.
Amazon Redshift Enhanced VPC Routing
If you select Yes, then Amazon Redshift forces all COPY and UNLOAD traffic between your cluster and your data repositories through your Amazon VPC. That is important as this routing affects the traffic between your services as it travels through the Internet (including traffic to other services within the AWS network).
Enhanced VPC Routing needs extra care, and you probably need to review AWS’s documentation here.
Each region in AWS has multiple Availability Zones that are isolated locations known as Availability Zones. Read more about Regions and Availability Zones in AWS’s documentation here.
Select No Preference to have AWS select the availability zone that your Redshift cluster will be created. Otherwise, select a specific availability zone.
VPC Security Groups
Last but not least, Security Groups. A Security Group is a set of rules that control access to your Redshift cluster, for example, a range of IP addresses that allow a third party tool to connect to your Redshift. You can select this Security Group here, but you can also assign it later in your cluster configuration. For more details on Security Groups read AWS documentation here.
Click on Continue.
Launch your Cluster
In the next page review your settings and click Launch Cluster.
Next, click close and in the next screen wait for your cluster to become Available and Healthy.
In the next section, we will see how to define networking and security settings for your Redshift instance.
This section contains parts of our knowledge base article “Setup Amazon Redshift”