McDonald’s or Burger King? Nike or Reebok? Marvel or DC?
Corporate history is full of business rivalries that we love reading about. In this post, we will compare two products, from two great companies. Redshift from Amazon and BigQuery from Google. Not as exciting as Batman vs. Spiderman but both products have managed to create quite some noise with their launch, and there are plenty of posts out there, trying to convince that one is better than the other.
Here at Blendo, we admire and respect both products as each one is an amazing engineering feat. For this reason, we will make an as unbiased as possible comparison of the two. Also, we will not talk about performance; we will not use artificial datasets to measure speeds and latencies. Why?
First of all, any artificial dataset no matter how good it is, it will always be something that will differ from the data that each user will use. Also, there are plenty of comparisons out there that are using a dataset to measure performance. Finally, these performance tests are best to be performed by each customer, using a subset of the actual data that will be loaded into the data warehouse at the end. In this way, a performance benchmark will make more sense.
Instead, we will focus on other aspects of data warehouse solutions, and we will compare the two solutions on these fronts like, data modeling, data consistency guarantees, and maintenance effort that is required.
A bit of history for Amazon Redshift and Google BigQuery
Amazon Redshift was released in 2012 as a beta version. It is based on ParAccel which is a PostgreSQL clone. The story behind ParAccel and Redshift is quite interesting. Amazon was an investor in Paraccel which was developing the ParAccel Analytic Database, a parallel relational database system, based on PostgreSQL. In 2011 Amazon invested in the company and in 2012, Amazon Redshift was announced which was using the ParAccel technology. In 2013, ParAccel was acquired by Actian.
Amazon Redshift inherited the relational nature of PostgreSQL while being columnar and designed for OLAP and BI applications. Initially, Amazon claimed that it could scale to petabytes while with the introduction of Spectrum, it should scale to exabytes of data.
On the other hand, BigQuery was developed internally by Google, and it is the evolution of Dremel. BigQuery is a web service that exposes Dremel over a REST interface. It can be perceived more like a hybrid system because it is columnar, but it also has excellent support for nested data. It is closer to the NoSQL systems although it speaks now an SQL dialect and it can also scale to petabytes of data.
All the above lead us to the first notable differences between the two systems.
|Amazon Redshift||Google BigQuery|
|Fork of PostgreSQL||Was built from scratch|
|Relational||Nested data structures are first class citizen|
|SQL||NoSQL (but now it also speaks SQL)|
|Analysts are more familiar with it||Developers love it|
Especially for the last difference. Think of all the debates about SQL vs. NoSQL systems. Developers are finding it easier to work with a system where nested data structures are natively supported while an analyst finds it much easier to interact with a database system that speaks SQL well.
Similarly, analysts are more familiar with JDBC interfaces while developers find it more natural to speak over REST with a system.
Having covered the origin of each technology, let’s see now how we can load each system with data and how they differ on this aspect. Amazon Redshift and Google BigQuery support both, bulk and streaming inserts.
Amazon Redshift supports loading data into it from S3 while Google BigQuery allows bulk inserts from data that lives on Google Cloud Storage. That is the most common way of loading data into both systems, and probably the most natural one as both are intended for OLAP and BI use cases where real-time is not usually the case.
Nevertheless, both systems also support inserting data in a streaming fashion. Amazon Redshift does this through Kinesis while Google BigQuery supports it more natively as part of the solution.
Finally, Google BigQuery also supports the direct import of data from Google Analytics, but you need to have a premium account there which is quite pricey.
When it comes to the data serializations supported by the two systems, there are no surprises there. Both of them support the following:
|Amazon Redshift||Google BigQuery|
|Bulk data:||S3||Google Cloud Storage|
|Streaming data:||Kinesis||Streaming Inserts|
|Other sources:||Google Analytics Premium|
|Data Serializations:||CSV, Avro, JSON||CSV, Avro, JSON|
Of course, both Amazon and Google has made sure that loading data from the rest of the infrastructure that each company supports, is easy. For example, it is possible to load data from Amazon Aurora to Amazon Redshift and from Cloud SQL to Google BigQuery, quite easily.
When it comes to data modeling, there are similarities but also some significant differences between the two systems. In both of them, data is organized on two different levels. Both have the concept of Tables, and then in Amazon Redshift, tables can be organized in Schemas, like in PostgreSQL while in Google BigQuery tables are grouped under Datasets. Although the names are different, the functionality of Schemas and Datasets is similar.
The important differences between the two systems are of the supported data types. More specifically, Redshift is closer to the standard SQL data types, e.g. INT4 and INT8 are supported, but it does not support the full range of PostgreSQL datatypes.
Google BigQuery, supports a smaller set of datatypes which also deviates more from the standard SQL set of datatypes but there are mappings to them. Additionally, and here is one of the prominent differences, BigQuery has native support for Array and STRUCT types. Because of that, nested data structures are first class citizens, and you can query them directly from BigQuery.
It is advised not to flatten out nested data when inserted in BigQuery and instead use the native support the system has and query the data directly.
On the contrary, the native support for nested data in JSON format that Amazon Redshift has is not great, and it is advised to flatten out your data before inserting such data to it.
In general, Google BigQuery is an append-only data management system. Currently, it does support updates and deletes of data (DML), but it is still limited and quite expensive to perform.
Amazon Redshift always supported updates and deletes via SQL, but there’s a catch here called Vacuuming that is inherited by PostgreSQL. When a row is deleted, the space it occupied is not claimed back, and to do that we need to vacuum the table. The same also happens with updates because an update for Redshift is mainly a delete followed by a new insert.
Similarly, manipulating tables in BigQuery is limited and expensive via standard SQL. You can do it via the HTTP API that it offers but you have to unload and reload the table.
Redshift supports the standard table operations that SQL has, and it is easier to work.
Finally, both systems support views, but these are not materialized as in other systems like PostgreSQL.
|Amazon Redshift||Google BigQuery|
|Data organization:||Tables, Schemas||Tables, Datasets|
|Datatypes:||Subset of SQL datatypes||Custom Datatypes with mappings|
|Nested datatypes:||Not supported||Natively supported|
|Table manipulation:||Supported with SQL.||Limited support and expensive, mainly through the HTTP API.|
|Support for updates and deletes:||Natively through SQL with no additional costs.||Limited and expensive.|
The two systems, offer different guarantees when it comes to data consistency. The main difference is that Amazon Redshift supports transactions while Google BigQuery does not. When working with Amazon Redshift, if anything goes wrong during a transaction, you can always do a roll-back and be sure that your data will fall back to their previous consistent state. As Google BigQuery is more of an append-only data storage engine, if something goes wrong with your data during a loading process, then you will have to start again from the beginning or write some logic on your ETL pipeline to continue from the point that the process has failed.
Deduplicating data in BigQuery is also harder to achieve and will cost you additional money. The best approach by taking into consideration the nature of the storage engine is to create views for deduplication and query these instead of the raw tables. In this way, deduplication will occur only when needed.
Inserting streaming data imposes some additional restrictions to both systems. As we mentioned earlier, for Amazon Redshift the way to go for streaming data is through Kinesis which offers at least once semantics. The system guarantees that your data will arrive, but you need to take care of duplicates on Amazon Redshift. On the other hand, BigQuery offers a best-effort deduplication of streaming data using a time window together with an “insertId” which has to be provided by you.
With this mechanism, BigQuery will automatically deduplicate streaming data if possible and thus simplify a bit the handling of streaming data when compared to Amazon Redshift.
|Amazon Redshift||Google BigQuery|
|Deduplication (batch)||Easy to do||More difficult & it costs.|
|Deduplication (Streaming)||No||Yes with time window + ID|
|Delivery Semantics (Streaming)||At least once (Kinesis)||At least once with best effort deduplication.|
Every system needs maintenance. Thay is even truer when we have to deal with large-scale, distributed systems like Google BigQuery and Amazon Redshift.
So, cluster management is an essential aspect of working with such systems and something that we have to consider when we decide to use the one or the other platform.
Here is where Google BigQuery shines. It is a fully managed system with support for High Availability. When you start working with BigQuery, you do not have to mess around with hardware options as the platform abstracts the hardware details entirely from you. You feed it with data and execute queries, the system will figure out based on your data volume and queries how to scale the underlying infrastructure to get answers on time.
On the other hand, Amazon Redshift requires more attention from the user when it comes to managing the infrastructure. The system does not abstract the hardware completely from the user, who has to choose between different instance types and configurations. Also, Amazon Redshift does not support multi-zone and multi-region deployments.
By default, the system takes snapshots and backups from your data on S3, so the possibility of losing data is minimal, but if you have a large volume of data on a large data warehouse, you have to wait a long time for the data to load again on your new instance. Nevertheless, it is possible to have such a deployment, but it is something that you as a user will have to take care.
Amazon Redshift addressed the above with the introduction of “Spectrum” where scaling and resource allocation do not require the involvement of the user in any of the decisions. Though an initial Amazon Redshift cluster is needed but it will be scaled automatically by Spectrum, considering the data that will be analyzed.
Another vital aspect of cluster management is the methods for connecting to it to interact with the data. Here Amazon Redshift is a bit friendlier, especially for analysts. It offers standard connectivity through JDBC/ODBC together with an API for anyone who would like to interact with it programmatically. Also, as a PostgreSQL clone, you can use all the standard community tools to connect and interact with a Redshift cluster.
Google BigQuery limits the interactions with the system, mainly through a REST API. JDBC/ODBC drivers exist, but they are only capable of performing SELECT related queries. For anything more involved concerning manipulating your data, you need to go through the HTTP API that Google exposes for BigQuery.
Here things are pretty standard for both systems, with security and authentication being an essential dimension for both products, with both companies making sure that they conform to the highest of standards.
Amazon Redshift relies on Amazon IAM for managing access and identity for users. It is a complex but very feature rich and robust system that offers excellent flexibility to an organization for managing complex scenarios for access and identity management.
Similarly, Google BigQuery relies on Cloud Identity & Access Management which is offered by Google. Also, you can also use OAuth as a convenient way to access the cluster, especially in cases where you would like to authorize third-party products.
Quotas and Resource allocation
As it was mentioned earlier, BigQuery scales automatically and hides the underlying hardware from the user. It does offer a kind of quotas though in terms of resources that are available. The defaults are 2000 slots per account. A slot is a unit of computational capacity which is consumed when a query is executed. BigQuery automatically calculates the amount of slots required for each query that you want to execute. Increasing the number of available slots for certain accounts is possible.
Also, BigQuery encourages an append-only data model by imposing strict DML quotas on a per day basis.
Amazon Redshift has a more traditional approach when it comes to sizing the cluster and the quotas that you have, which are capped by the resources of the cluster you have created. Available CPU, memory and disk space is predefined when you size your cluster before it is created. Besides, there aren’t any quotas related to inserts and updates as in the case of BigQuery.
However, the main problem with Amazon Redshift is resizing, as you will have to relocate all your data to a new cluster. Depending on your data volume, this takes considerable time.
Finally, both systems have a limit of 50 concurrent queries, a restriction which is typical for OLAP systems as queries are expected to run a long time but having many queries running concurrently is not that important.
The two systems differ a lot on what capabilities they offer you for optimizing your cluster. Google BigQuery as a self-management system gives little space for optimizing the underlying database system, and as a user, you rely on its query planner to figure out the most efficient resource allocation and to find a good balance between cost and performance. All these are happening behind the scenes for you. There are a few things that you can do though, in the form of good practices.
- Optimize the queries on the statement level.
- You can easily partition your data over time. If you work with time series data, you can reduce costs and increase speed in this way.
- Exploit the excellent support for nested data structures so you should avoid flattening out and joining JSON data on your queries.
Amazon Redshift offers a wide range of knobs and pulls to optimize your cluster. The most important are the following:
- Distribution Keys. These allow you to define how the data will be distributed across the nodes in your cluster. Usually, you want your data to be evenly distributed across your nodes to exploit all their processing power fully.
- Sort Keys. Redshift is a columnar database, so the way that your data is stored in a column matters. Depending on your queries you should define sort keys because pre-sorted data will heavily affect performance.
- Column compression. Another important parameter that affects performance is how the data of a specific column is compressed. Here you try to exploit the nature of the data you work with to reduce the time it takes to load the data into memory and work with it.