How to Load data from Salesforce to Bigquery Step-by-Step

Blendo Team

How may I load data from Salesforce to Google BigQuery for further analysis? This post is an overview of how to access and extract your data from Salesforce through its API and how to load it into BigQuery. Alternatively, you may checkout Blendo to load your data sets from Salesforce to BigQuery in minutes for you.

Pull data from Salesforce

You can’t use a Data Warehouse without data, so the first and most important step is to extract the data sets you want from Salesforce.

Salesforce has many products and it’s also a pioneer in cloud computing and the API economy. This means that it offers a plethora of APIs to access the services and the underlying data sets. In this post we’ll focus only on Salesforce CRM, which again exposes a large number of APIs to the world.

More specifically, when choosing which API to use, we have the following options.

  • Chatter REST API
  • Bulk API
  • Metadata API
  • Streaming API
  • Apex REST API
  • Apex SOAP API
  • Tooling API

You will need more time to read this post than to integrate Salesforce to BigQuery.

Effortlessly Sync All Your Salesforce Data to BigQuery

Pull data from the Salesforce Rest API

From the above list, the complexity and feature richness of the Salesforce API is more than evident. The REST API and the SOAP API are exposing the same functionalities but using different protocols. Interacting with the REST API can be done by using tools like CURL or Postman or by using HTTP clients for your favorite language or framework. A few suggestions:

The Salesforce REST API supports oAuth 2.0 authentication, more information can be found in Salesforce’s Understanding Authentication article. After you successfully authenticate with the REST API, you have to start interacting with its resources and start fetching data from it in order to load them into your data warehouse.

It’s easy to get a list of all the resources we have access to, for example using curl we can execute the following:

curl -H "Authorization: Bearer token"

A typical response from the server will be a list of available resources in JSON or XML, depending on what you have asked as part of your request.

{ "sobjects" : "/services/data/v26.0/sobjects", "licensing" : "/services/data/v26.0/licensing", "connect" : "/services/data/v26.0/connect", "search" : "/services/data/v26.0/search", "query" : "/services/data/v26.0/query", "tooling" : "/services/data/v26.0/tooling", "chatter" : "/services/data/v26.0/chatter", "recent" : "/services/data/v26.0/recent" }

The Salesforce REST API is very expressive, it also supports a language called Salesforce Object Query Language (SOQL) for executing arbitrarily complex queries. For example, the following curl command will return the name fields of accounts:

{ "sobjects" : "/services/data/v26.0/sobjects", "licensing" : "/services/data/v26.0/licensing", "connect" : "/services/data/v26.0/connect", "search" : "/services/data/v26.0/search", "query" : "/services/data/v26.0/query", "tooling" : "/services/data/v26.0/tooling", "chatter" : "/services/data/v26.0/chatter", "recent" : "/services/data/v26.0/recent" }

and the result will look like the following:

{ "done" : true, "totalSize" : 14, "records" : [ { "attributes" : {

Again, the result can be either in JSON or XML serialization. We would recommend using JSON as it will make the whole data connection process easier because the most popular data warehousing solutions natively support it.

With XML you might have to transform it first into JSON before loading the data into the repository. More information about SOQL can be found on the Salesforce Object Query Language specification page.

If for any reason you would prefer to use SOAP, then you should create a SOAP client first: for example, you can use the Web Service Connector (WSC) client. Or create your own using the WSDL using the information provided by this guide.

Despite the protocol changes, the architecture of the API remains the same, so again you will be able to access the same resources.

After you have your client ready and you are able to connect to Salesforce you need to perform the following steps:

  1. decide which resources to extract from the API
  2. map these resources to schema of the data warehouse repository that you will use
  3. transform the data into it and
  4. load the transformed data on the repository based on the instructions below

As you can see, accessing the API alone is not enough for ensuring the operation of a data pipeline that will safely and on time deliver your data on a data warehousing solution for analysis.

Pull Data using the Salesforce Streaming API

Another interesting way of interacting with SalesForce is through the Streaming API. With it you define queries and every time something changes to the data that register to this query you get notifications. So for example, every time you get a new account created the API will push a notification about the event to your desired service.

This is an extremely powerful mechanism that can guarantee almost real-time updates on your Data Warehouse repository. In order to implement something like that though, you need to take into consideration the limitations of both ends, while ensuring that delivery semantics that your use case requires for the data management infrastructure that you will build.

For more information, you can read the documentation of the Streaming API.

About Salesforce

Salesforce is the innovative company behind the world’s #1 CRM platform that employees can access entirely over the Internet — there’s no infrastructure to buy, set up, or manage — you just log in and get to work. But Salesforce has become something much bigger than its CRM solution, currently, it offers products for data analytics, marketing, data and even cloud infrastructure for IoT. It’s easy to understand that there’s an abundance of data that your company is generating on Salesforce if you are using its products.

Prepare your Salesforce Data for Google BigQuery

Before you load your data set into BigQuery, you should make sure that it is presented in a supported format, so for example, if the API you pull data from returns XML you have to first transform it into a serialization that BigQuery understands.

Currently, two data formats are supported:

  • CSV, and
  • JSON

You also need to make sure that the data types you are using are the ones supported by BigQuery, which are the following:


For more information please check the Preparing Data for BigQuery page on the documentation.

Load Salesforce Data to Google BigQuery Data Warehouse

Before you start querying your data with Google BigQuery, you need to first load Salesforce data to Google BigQuery. Actually, there are two approaches you can follow to achieve this.

The first is to load the data and the second one is to set up your data as a federated data source. Federated data sources are created only on Google Cloud Storage, and are only some of the many possible sources from which you can load your data into BigQuery, so we are not going to elaborate on this. If you decide to use it, you can follow the instructions for loading your data into Google Cloud Storage. Do not forget to check the limitations of this approach.

About Google BigQuery

BigQuery is Google’s data warehouse. It’s part of the Google Cloud Platform and it also speaks SQL like Amazon Redshift does. Queries are executed against append-only tables using the processing power of Google’s infrastructure. It is also fully managed and is offered as a service over the cloud. You can interact with it through its web UI, using a command line tool while a variety of client libraries exist so you can interact with it through your application.

If you want to load Salesforce data to Google Cloud BigQuery, you have to use one of the following supported data sources.

  1. Google Cloud Storage
  2. Sent data directly to BigQuery with a POST request
  3. Google Cloud Datastore Backup
  4. Streaming insert
  5. App Engine log files
  6. Cloud Storage logs

From the above list of sources, 5 and 6 are not applicable in our case.

For Google Cloud Storage, you first have to load your data into it, there are a few options on how to do this, for example, you can use the console directly – do not forget to follow the best practices.

Another option is to post your data through the JSON BigQuery API, as we see again APIs play an important role in both the extraction but also the loading of data into our data warehouse. In its simplest case it’s just a matter of one HTTP POST request using a tool like CURL or Postman. It should look like the following example.

POST /upload/storage/v1/b/myBucket/o?uploadType=media&name=myObject 
HTTP/1.1 Host: 
Content-Type: application/text 
Content-Length: number_of_bytes_in_file 
Authorization: Bearer your_auth_token your Salesforce data

and if everything went ok, you should get something like the following as a response from the server:

HTTP/1.1 200 Content-Type: application/json { "name": "myObject" }

Working with Curl or Postman is good for testing, but if you would like to automate the process of loading your data into Google Bigquery, you should write some code to send your data to Google Cloud Storage.

In case you are developing on the Google App Engine you can use the library that is available for its the languages:

  1. Python
  2. Java
  3. PHP
  4. Go

If you are using one of the above languages and you are not coding for the Google App Engine, you can use it to access the Cloud Storage from your environment. Interacting with such a feature rich product like Google Cloud Storage can become quite complicated depending on your use case. For more details on the different options that exist you can check Google Cloud Storage documentation.

If you are looking for a less involved way of using Cloud Storage, you can consider a solution like Blendo.

After you have loaded your data into Google Cloud Storage, you have to create a Load Job for BigQuery to actually load the data into it. This Job should point to the source data in Cloud Storage that have to be imported, this happens by providing source URIs that point to the appropriate objects.

The previous method described, used a POST request to the Google Cloud Storage API for storing the data there and then loading it into BigQuery.

Another way to go is to do a direct HTTP POST request to BigQuery with the data you would like to query. This approach is similar to how we loaded the data to Google Cloud Storage through the JSON API, but it uses the appropriate end-points of BigQuery to load the data there directly.

More information can be found on the Google BigQuery API Reference and on the page that describes how to load data into BigQuery using POST. You can interact with it using the HTTP client library of the language or framework of your choice, a few options are:

The best way to load data from Salesforce to Google BigQuery

So far we just scraped the surface of what can be done with Google BigQuery and how to load data into it. The way to proceed relies heavily on the data you want to load, from which service they are coming from and the requirements of your use case.

Things can get even more complicated if you want to integrate data coming from different sources.

Are you striving to achieve results right now?

Instead of writing, hosting and maintaining a flexible data infrastructure, is to use a product like Blendo that can handle this kind of problems automatically for you.

Blendo is an ETL tool that integrates with multiple sources or services like databases, CRM, email campaigns, analytics and more. Quickly and safely move all your data from Salesforce into Google BigQuery and start generating insights from your data.

Help your sales and executive team take ownership of the insights that live in your Salesforce CRM.

Blendo is the easiest way to automate powerful data integrations.

Try Blendo free for 14 days. No credit card required.