PostgreSQL is an amazing database! It’s one of the most well developed and mature relational database management systems, with its development, ongoing for more than 20 years now, from supporting small websites on single machines to data warehouses and large scale concurrent setups.
Because of that, companies frequently use Postgres as the database of choice for Products.
How to Integrate PostgreSQL with your Data Warehouse
Because of the broad usefulness of PostgreSQL, companies frequently use Postgres as the database of choice for Products. As a result, an essential and valuable part of your company data might live in a version of PostgreSQL.
This data has to be part of your data warehouse, be accessible by all the stakeholders, and enriched with data from other data sources.
Then you can have a complete view of all the insights you need for every function of your company.
“We have better insights and help us optimize where to spend our time and energy.”
-Agamemnon Papazoglou, Co-Founder, MyJobNow
Relying on PostgreSQL to run your product, means that you need to be extremely careful when you try to deploy a system that interacts with it. That is the case, even if the purpose is to pull data out from it and copy it to another database. In any case, you should never use your production database for analytics payloads.
So, the question is…
How can we safely move the data from a PostgreSQL instance to our data warehouse while ensuring that our production database will not be affected by this process?
There are several different ways to achieve this. Let’s see the most frequently deployed ones with advantages and disadvantages of each one.
In more detail, we will discuss:
- PostgreSQL Replication: Working with replicas
- Writing custom ETL/ELT Scripts
- Working with custom ETL/ELT Scripts using log replication
- Working with an ETL/ELT Vendor
PostgreSQL Replication: Working with Replicas
As scaling and high availability mechanism, PostgreSQL allows for setting up a cluster of databases. One database is the leading database (where all the current writes and reads are performed), and the rest of the cluster is following the leader and replicate its state. This mechanism is implemented mainly to offer a way for high availability, load balancing, and scaling of the database system but it can also be used as a method to replicate your production data to a data warehouse.
As we said in the beginning, you can use PostgreSQL for analytics workloads which means that you may use it as a data warehouse.
By using replication, we can have our production database used for our product and one of the replicas as the data warehouse.
The main benefit here is that we are can be sure that the production database is not affected by the ELT/ELT process at all. The leader database communicates the updates whenever it’s possible to the replica that acts as the data warehouse.
There are some issues, though:
- High maintenance. Someone has to set up a cluster and make sure that it works properly.
- Apart from fine-tuning the production database, now you may need to fine-tune a whole cluster. Is the replica lagging? How can we ensure that the lag is acceptable?
- High complexity. Setting up and running a cluster of PostgreSQL databases is not trivial.
- How the analytic queries on the replica affect the replication.
- Do we need all the data? If not, does it make sense to replicate everything anyway? In the end, we figured out how to replicate the production data, but we also need to ETL/ELT the rest of the data (e.g., Sales, Marketing, Support cloud application data) to the Replica database that acts as a data warehouse. How are we going to do that? How does this affect the replication process?
- The Replica needs to have the same version as the production database. Is this desirable for analytic purposes?
PostgreSQL integration with ETL/ELT Scripts
You might not want to use PostgreSQL in general, or the current version of the one you run on production in particular, for your data warehouse. That might happen for many reasons.
- Organizational reasons, for example, your analytics department wants to have ownership over the technologies used for analytics.
- Technological reasons, for example, the newest version of PostgreSQL has useful features for the analytics payloads that you have, or your scale requires to use something else like Amazon Redshift.
In any case, using a replica might not be the best solution for you.
A common way of moving data out of your production database is to use ETL/ELT scripts. These scripts are connecting directly to your database and pull out data and push it into your data warehouse.
There are some benefits here:
- You have much more control over what data you will copy from your production page to your data warehouse. For example, you might not want to move personal identifiable information, or, for cost-related reasons, you might want to pull only the recent data.
- You can transform the data. Do you need to incorporate some complex logic to transform your data into the shape that your analytics team needs? You have the flexibility of running complex logic over your data before you push it into your data warehouse.
- You have more control over when the data will move from the production database to your data warehouse.
But there are also some issues:
- You pretty much need a complete software engineering pipeline for PostgreSQL ETL. Who is writing and maintaining the scripts? How do we version control them? What happens when a bug appears, and how do we react?
- It’s not easy to identify how the scripts will affect your production database. The queries you have to perform are usually simple filtered selects but still, what happens when you perform these queries on large tables over busy time for your product?
- If you transform the data, how do you keep track of their lineage? How do you turn back in time if something went wrong, how do you update it when a transformation is not desired anymore or even wrong?
Working with ETL/ELT Scripts using log replication
You may address a number of the issues mentioned in the previous case by writing scripts. These scripts, instead of pulling data by performing queries, replicate data by listening to the replication log of the PostgreSQL database and using this information to push data into the data warehouse.
That looks like a hybrid solution between using replication and ETL/ELT scripts.
You set the database to replicate the data and instead of the script performing queries, acts as a replica and reads the data from the replication log.
Contrary to other databases, where the WAL (Write Ahead Log) is a stream of SQL queries, PostgreSQL implements the log in a binary form for performance reasons. It is possible though to use Logical Decoding to get the updates of the log from version 9.4.15 and above. In this way, it’s possible to replicate updates adding minimal stress to the database.
The benefits here are:
- You can replicate the data with guaranteed minimal stress on your database.
- You can transform the data if you want.
- Deletions are supported for replication.
- You maintain control over the data you want to replicate.
There also some disadvantages, though:
- You need specific versions of PostgreSQL for this method to work.
- If you are working with RDS and Google Storage, this is not supported.
- This feature can be activated only on the master database so you cannot connect to the replica.
- An increased volume of storage is required.
- You still need to do the initial sync using queries by connecting directly to the database. The Log only keeps updates and not the complete history.
- It’s a technically complex method to use and requires significant maintenance.
Working with an ETL/ELT Vendor
With this approach, you can combine the best of all the solutions we have described above. You get:
- Minimal stress to your database.
- Control over the data you want to replicate.
- No scripts or complex PostgreSQL ETL tools.
- The ability to ingest any other cloud data source (like Salesforce, NetSuite and more connectors)
- Minimal implementation time.
So here is where a solution like Blendo fits.
Blendo is a data infrastructure as a service solution. It is an ELT platform that enables organizations to pull together data from disparate silos into their data warehouse and reduce data movement. As a result, organizations can accelerate queries without having to manage the underlying infrastructure.
What are the differences between ETL vs. ELT
Get All the Postgres Integrations You Need with Blendo
Blendo is a PostgreSQL integration tool that can help you ingest data from PostgreSQL, along with other data sources and all critical business data into any database. Here are our supported integrations with PostgreSQL:
Amazon Redshift is one of the most popular data warehousing solutions, which is part of the Amazon Web Services (AWS) ecosystem. It is a petabyte-scale, fully managed data warehouse as a service solution that runs in the cloud.
You will definitely need some work on managing it, but it returns in value.
With Blendo, you can have a PostgreSQL Redshift data pipeline up and running smoothly without the need for super technical resources.
Blendo is an Amazon APN Technology partner. | Click here and download our free ebook: Amazon Redshift Guide for Data Analysts
Google BigQuery (part of Google Cloud) is Google’s serverless, highly scalable, low-cost enterprise data warehouse designed to make all your data analysts productive. Because there is no infrastructure to manage, you can focus on analyzing data. Find meaningful insights using familiar SQL, and additionally, you don’t need a database administrator.
You can set up a PostgreSQL BigQuery integration with ease and use Blendo as the reliable choice for moving your data and stop worrying about it.
If you want to build a reliable data warehouse quickly and you are on the edge of choosing, you may read our Amazon Redshift vs. Google BigQuery guide.
Blendo is a Google Cloud Technology Partner.
Microsoft SQL Server is one of the oldest and most mature database systems. SQL Server exists in different editions or flavors. You may use MS SQL server as onsite, in Microsoft’s Azure, Google Cloud or AWS.
Here too, you will need to manage your instances. Though it is widely accepted as a significant data warehouse for analytics purposes and companies large and small use it.
With Blendo you can have a PostgreSQL SQL Server data pipeline up and running quickly without the need for super technical resources.
You can now connect behind private networks with an SSH tunnel.
PostgreSQL is one of the most well known, popular, and well-supported databases. Amazon Redshift is based on an earlier version of PostgreSQL. PostgreSQL can scale to pretty large datasets and makes it an ideal database for analytics-related tasks.
Blendo is an Amazon APN Technology partner.
Are you using PostgreSQL as your product database? You can sync PostgreSQL to another PostgreSQL instance for analytical reasons.
Snowflake is a data warehouse built specifically for the cloud, it has a unique architecture which allows compute and storage to scale independently, instantly.
Moreover, the infrastructure, availability, scalability, and security are all automatically taken care, thus allowing you to focus on digging deeper into the data instead of managing it.
Blendo is a Snowflake Technology partner. | Snowflake ETL: Tools and Ways to Get Started
Panoply provides end-to-end data management-as-a-service. It can quickly set up a managed data warehouse and scale it without you having to deal with managing your analytical infrastructure. From Panoply, you can view, process, and query your data. Panoply also handles maintenance tasks like vacuuming and backup, optimizes your queries, and more.
Blendo is a Panoply partner.
Choose your data warehouse
Your digital analytics will need a data warehouse. Any data warehouse can fit your needs and provide in certain stages of your company.
Having a data infrastructure with Blendo can help you choose any of the above solutions undistracted and with the certainty that you can scale.
Business Intelligence With Blendo
In the end, whatever the data warehouse, companies want to have all data sources available for their analytics teams. Leadership will need to enable them with Postgres integration tools that will make their work faster and reliable so as the corporate decisions will not be outdated or obsolete.
Blendo will support you across the data journey so you can collect your data with ease, and run analyses in seconds without a server setup. Finally, you will need to visualize it with the dashboard and business intelligence tool you prefer. That will enable your company to uncover insights to understand the customer journey, drive better campaign results, and offer better customer experience.
Whichever tool you use, Blendo is vendor agnostic.
We are partnering with BI tools so you can have your integrations and rest of business data in Looker, Sisense, Chartio, Yellowfin. You can use tools like Microsoft Power BI, Google Data Studio, or Tableau.
With Blendo, you can integrate any data source, within minutes, without having to write any code. Blendo is designed to offer self-serve analytics-ready data in modern data warehouses. Don’t hesitate to contact us and try Blendo for free.