Some time ago we discussed how you can access data that are stored in Amazon Redshift and PostgreSQL with Python and R. Let’s say you did find an easy way to store a pile of data in your BigQuery data warehouse and keep them in sync. Now you want to start messing with it using statistical techniques, maybe build a model of your customers’ behaviour, or try to predict your churn rate.
To do that, you will need to extract your data from BigQuery and use a framework or language that is best suited for data analysis and the most popular so far are Python and R. In this small tutorial we will see how we can extract data that is stored in Google BigQuery to load it with Python or R, and then use the numerous analytic libraries and algorithms that exist for these two languages.
Connect to BigQuery with Python
In order to pull data out of BigQuery, or any other database, we first need to connect to our instance. To do so, we need a cloud client library for the Google BigQuery API. Although the options are quite many, we are going to work with the Google Cloud Bigquery library which is Google-supported.
However, feel free to explore other third-party options like the BigQuery-Python library by tylertreat which is also widely used, well-documented and frequently maintained.
So, in order to be able to install the Google Cloud Library, we assume that you have already setup a Python Development Environment. If this is not the case you can refer to Python Development Environment Setup Guide.
The next step is to install the library and this can be easily achieved by executing the following command:
pip install --upgrade google-cloud-bigquery
In order to be able to connect to the database, you need to download locally the .json file which contains the necessary credentials from within your service account. In case you didn’t already have a service account, you would have to create one and then download in our local machine the previously mentioned JSON.
Now that everything is set we can move on and write some Python code in order to initialize the connection.
from google.cloud import bigquery from google.oauth2 import service_account credentials = service_account.Credentials.from_service_account_file( 'path/to/file.json') project_id = 'my-bq' client = bigquery.Client(credentials= credentials,project=project_id)
As you can see, the parameters you need to specify are the project_id and the location of the json key file. In BigQuery, a project is the top-level container and provides you default access control across all datasets.