Some time ago we discussed how you can access data that are stored in Amazon Redshift and PostgreSQL with Python and R. Let’s say you did find an easy way to store a pile of data in your BigQuery data warehouse and keep them in sync. Now you want to start messing with it using statistical techniques, maybe build a model of your customers’ behavior, or try to predict your churn rate.
To do that, you will need to extract your data from BigQuery and use a framework or language that is best suited for data analysis and the most popular so far are Python and R. In this small tutorial we will see how we can extract data that is stored in Google BigQuery to load it with Python or R, and then use the numerous analytic libraries and algorithms that exist for these two languages.
Connect to BigQuery with Python
In order to pull data out of BigQuery, or any other database, we first need to connect to our instance. To do so, we need a cloud client library for the Google BigQuery API. Although the options are quite many, we are going to work with the Google Cloud Bigquery library which is Google-supported.
However, feel free to explore other third-party options like the BigQuery-Python library by tylertreat which is also widely used, well-documented and frequently maintained.
So, in order to be able to install the Google Cloud Library, we assume that you have already setup a Python Development Environment. If this is not the case you can refer to Python Development Environment Setup Guide.
The next step is to install the library and this can be easily achieved by executing the following command:
pip install --upgrade google-cloud-bigquery
In order to be able to connect to the database, you need to download locally the .json file which contains the necessary credentials from within your service account. In case you didn’t already have a service account, you would have to create one and then download in our local machine the previously mentioned JSON.
Now that everything is set we can move on and write some Python code in order to initialize the connection.
from google.cloud import bigquery from google.oauth2 import service_account credentials = service_account.Credentials.from_service_account_file( 'path/to/file.json') project_id = 'my-bq' client = bigquery.Client(credentials= credentials,project=project_id)
As you can see, the parameters you need to specify are the project_id and the location of the json key file. In Bigquery, a project is the top-level container and provides you default access control across all datasets.
Executing Queries with Python
With the BigQuery client, we can execute raw queries on a dataset using the query method which actually inserts a query job into the BigQuery queue.
These queries are executed asynchronously in the sense that there is no timeout specified and that the program will wait for the job to complete.
As long as the job is completed the query method returns a QueryJob instance according to documentation which, among others, contains the produced results.
For more details you can always refer to the official documentation here
The Python code required follows:
query_job = client.query(""" SELECT * FROM dataset.my_table LIMIT 1000""") results = query_job.result() # Waits for job to complete.
Note that the above query uses standard SQL syntax as the client library default to this. If you wish you can change the SQL dialect to legacy as follows:
job_config.use_legacy_sql = True query_job = client.query(""" SELECT * FROM dataset.my_table LIMIT 1000""", job_config=job_config) results = query_job.result() # Waits for job to complete.
If Python is not your cup of tea and you prefer R instead, you are still covered. Getting your data from Google BigQuery is equally easy as in Python – or even easier.
Connect to BigQuery with R
Again we are going to use an open source library called BigrQuery, which is created and maintained by Hadley Wickham, Chief Scientist at RStudio. In order to install we simply have to run the following command from within R console:
Executing Queries with R
According to the library documentation, the first time we’ll need to authorize the application to access Google Cloud services. As documented on the Authentication section of the bigrquery GitHub page, we’ll follow the prompts within R to open the authorization URL and later copy the authorization code back to R. You’ll only need to authorize the library once, requests performed after the first authorized one will refresh access credentials.
So the overall procedure that we are going to follow can be summarized as:
- Import ‘BigrQuery’ library.
- Specify the project ID from the Google Cloud Console like we did with Python.
- Form your query string
- Call query_exec with your project ID and query string.
#import library library(bigrquery) # Use your project ID here project_id <- "your-project-id" # put your project ID here # Example query sql_string <- "SELECT * FROM dataset.my_table LIMIT 1000" # Execute the query and store the result query_results <- query_exec(sql_string, project = project_id, useLegacySql = FALSE)
Again, if you wish you can execute queries using legacy sql by changing the value of the useLegacySql to true in the query_exec function.
As you can see, getting your data from BigQuery for further analysis in Python and R is really easy. The true power of a database that stores your data in comparison with CSV files etc. is that you have SQL as an additional tool. Invest some time learning how to work with SQL and you will not regret it, having structured data in a database and using SQL to pre-process your data before you start building your statistical models will save you time and resources.
Although in this article we focused mainly on BigQuery, using any other database is equally easy. The main difference will be the selection of the appropriate library for Python and for R.
Would love to hear your thoughts in the comments below.