BigQuery

BigQuery has a distributed architecture running on thousands of nodes across Google's data centers. Your datasets are not stored in a unique server but are chunked and replicated across different regions.

The storage and compute layers are fully decoupled in BigQuery. This means that the query engine runs on different servers from the servers where the data is stored. This feature enables BigQuery to provide great scalability both in terms of data volume and query execution. This decoupled paradigm is only possible thanks to Google's Petabit network, which moves data very quickly from one server to another, leveraging Google's proprietary fiber cables across the globe.

Unlike traditional data warehouses, BigQuery stores data in columnar format in Google File System codename Colossus.

Fully decoupled from storage, the compute layer is responsible for receiving query statements from BigQuery users and executing them in the fastest way. The query engine is based on Dremel.

BigQuery is an append-only database, meaning as new rows are updated, rows are added to the database, rather than being updated in place.

BigQuery supports Standard SQL.

Each node provides a number of processing units called BigQuery slots to execute the business logic of the query. A BigQuery slot can be considered a virtual CPU on a Dremel node. The calculation of the slots needed to perform a specific query is automatically managed depending on the complexity of the query and impacted data volumes.

Beginner's Guide

https://medium.com/google-cloud/a-seniors-guide-to-kickstart-your-bigquery-journey-75566e131983

No Index

Each time BigQuery executes a query, it executes a full-column scan. BigQuery doesn't use or support indexes. Because BigQuery performance and query s are based on the amount of data scanned during a query, design your queries so that they reference only the columns that are relevant to the query.

https://cloud.google.com/files/BigQueryTechnicalWP.pdf

Query Plan

The Query Plan explanation breaks down the stages that the query went through, the number of input/output rows handled at each stage, and the timing profile within each stage. Using the results from the explanation can help you understand and optimize your queries.

https://cloud.google.com/blog/topics/developers-practitioners/bigquery-admin-reference-guide-query-processing

Serverless

There's no infrastructure to manage in BigQuery. Developers focus on finding insights using standard SQL using on-demand or flat-rate options.

Wide Column Storage

BigQuery is designed to ingest and store large amounts of data, and make that data accessible for large-scale analytics. BigQuery stores data in wide columns which are in separate files that are compressed and work well with distributed systems like Colossus.

https://cloud.google.com/blog/topics/developers-practitioners/bigquery-admin-reference-guide-storage

In-Memory Shuffling

BigQuery uses dynamic query planning and different shards using in-memory shuffling for optimal distributed performance.

Clustered tables allow for better partitioning and performance.

Loading Data

Learn how to load CSV data in batch and analyze in BigQuery.

Peta Bytes

BigQuery is fully managed and lets you search through terabytes of data in seconds.

BigQuery Data Transfer Service

You can upload data files from local sources, Google Drive, or Cloud Storage buckets, take advantage of BigQuery Data Transfer Service (DTS), Cloud Data Fusion plug-ins, or leverage Google's industry-leading data integration partnerships.

BigQuery Migration Service

https://cloud.google.com/blog/topics/data-warehousing/announcing-bigquery-migration-service

Anatomy of a BigQuery query

Google BigQuery has the benchmark data for the Wiki100B table. This table contains 100 billion rows and is about 7 Terabytes in size.

https://cloud.google.com/blog/products/gcp/anatomy-of-a-bigquery-query

In memory query execution

BigQuery is a fast petabyte-scale analytics database. To achieve that level of performance, BigQuery executes queries completely in memory by leveraging Google’s peta bit scale networking technologies, such as Andromeda and Jupiter.

Shuffle is required for execution of large and complex joins, aggregations and analytic operations.

In-memory BigQuery shuffle stores intermediate data produced from various stages of query processing in a set of nodes that are dedicated to hosting remote memory.

Query plan and timeline

When BigQuery executes a query job, it converts the declarative SQL statement into a graph of execution, broken up into a series of query stages, which themselves are composed of more granular sets of execution steps.

Managing input data and data sources

When evaluating your input data, consider the required I/O. How many bytes does your query read? Are you properly limiting the amount of input data? Is your data in native BigQuery storage or an external data source? The amount of data read by a query and the source of the data impact query performance and .

BigQuery Spanner federation

BigQuery Spanner federation enables BigQuery to query data residing in Spanner in real-time, without copying or moving data.

https://cloud.google.com/bigquery/docs/cloud-spanner-federated-queries

Replicating from Cloud Spanner to BigQuery at scale

https://cloud.google.com/blog/topics/developers-practitioners/replicating-cloud-spanner-bigquery-scale

Optimizing communication between slots

When evaluating your communication between slots consider the amount of shuffling that is required by your query. How many bytes are passed between stages? How many bytes are passed to each slot? The amount of data that is shuffled directly impacts communication throughput and query performance.

Optimizing against multiple reads

https://boonepeter.github.io/posts/unnecessary_bigquery_optimization

Time Travel

BigQuery lets you use time travel to access data stored in BigQuery that has been changed or deleted. You can access the data from any point within the last seven days. You can use time travel to query data that was updated or deleted, restore a table that was deleted, or restore a table that expired.

https://cloud.google.com/bigquery/docs/time-travel

https://medium.com/codex/more-options-to-restore-your-data-in-google-bigquery-181a32f7fa76

Optimizing query computation

When evaluating the computation that is required by a query, consider the amount of work that is required. How much CPU time is required? Are you using functions like JavaScript user-defined functions that require additional CPU resources?

Managing materialization and outputs

When evaluating your output data, consider the number of bytes written by your query. How many bytes are written for your result set? Are you properly limiting the amount of data written? Are you repeatedly writing the same data? The amount of data written by a query impacts query performance (I/O). If you are writing results to a permanent (destination) table, the amount of data written also has a .

BigQuery BI Engine

BI Engine is an in-memory analysis service that helps customers get low latency performance for their queries across all BI tools that connect to BigQuery.

https://cloud.google.com/bigquery/docs/bi-engine-intro

Comparisons

https://dzone.com/articles/cloud-data-warehouse-comparison-redshift-vs-bigque

Migration from Redshift to BigQuery

https://medium.com/99dotco/a-migration-misstep-from-redshift-to-bigquery-13e9000c3f50

Using BigQuery

Anti patterns

Avoid query anti-patterns that impact performance in BigQuery.

Partitions and Clusters

https://towardsdatascience.com/how-to-use-partitions-and-clusters-in-bigquery-using-sql-ccf84c89dd65

Managing partitioned tables

A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control s by reducing the number of bytes read by a query.

You can partition BigQuery tables by:

Time-unit column: Tables are partitioned based on a TIMESTAMP, DATE, or DATETIME column in the table.
Ingestion time: Tables are partitioned based on the timestamp when BigQuery ingests the data.
Integer range: Tables are partitioned based on an integer column.

If a query filters on the value of the partitioning column, BigQuery can scan the partitions that match the filter and skip the remaining partitions. This process is called pruning.

https://cloud.google.com/bigquery/docs/managing-partitioned-tables

Using datasets

A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to your tables and views. A table or view must belong to a dataset, so you need to create at least one dataset before loading data into BigQuery.

BigQuery datasets are subject to the following limitations:

You can set the geographic location at creation time only. After a dataset has been created, the location becomes immutable and can't be changed by using the Cloud Console, using the bq command-line tool, or calling the patch or update API methods. All tables that are referenced in a query must be stored in datasets in the same location.
When you copy a table, the datasets that contain the source table and destination table must reside in the same location.
Dataset names must be unique for each project.

https://cloud.google.com/bigquery/docs/datasets-intro

Using dataset properties

After you create a dataset, you can update the following dataset properties:

Description
Default expiration time for new tables
Default partition expiration for new partitioned tables
Access controls
Labels

https://cloud.google.com/bigquery/docs/updating-datasets

BigQuery data update optimization by partitioning

https://medium.com/cstech/google-bigquery-data-update-optimization-9d788bfe811b

Updating default table expiration times

https://cloud.google.com/bigquery/docs/updating-datasets#table-expiration

Using tables

A BigQuery table contains individual records organized in rows. Each record is composed of columns (also called fields).

Every table is defined by a schema that describes the column names, data types, and other information. You can specify the schema of a table when it is created, or you can create a table without a schema and declare the schema in the query job or load job that first populates it with data.

BigQuery supports the following table types:

Native tables: tables backed by native BigQuery storage.
External tables: tables backed by storage external to BigQuery. For more information, see Querying External Data Sources.
Views: Virtual tables defined by a SQL query. For more information, see Creating views.

https://cloud.google.com/bigquery/docs/tables-intro

Table schemas

https://cloud.google.com/bigquery/docs/schemas

Clustered tables

When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.

Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query that contains a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data. You might not see a significant difference in query performance between a clustered and unclustered table if the table or partition is under 1 GB.

https://cloud.google.com/bigquery/docs/clustered-tables

Paritioning vs. clustering

Both partitioning and clustering can improve performance and reduce query . Use clustering when you don't need strict guarantees before running the query. Use partitioning when you want to know query s before a query runs. You may prefer cluster over partitioning when partitioning results in a small amount of data per partition approximately less than 1GB.

https://cloud.google.com/bigquery/docs/partitioned-tables#partitioning_versus_clustering

Views

A view is a virtual table defined by a SQL query. When you create a view, you query it in the same way you query a table. When a user queries the view, the query results contain data only from the tables and fields specified in the query that defines the view.

https://cloud.google.com/bigquery/docs/views-intro

Materialized Views

In BigQuery, materialized views are precomputed views that periodically cache the results of a query for increased performance and efficiency. BigQuery leverages precomputed results from materialized views and whenever possible reads only delta changes from the base table to compute up-to-date results. Materialized views can be queried directly or can be used by the BigQuery optimizer to process queries to the base tables.

Queries that use materialized views are generally faster and consume fewer resources than queries that retrieve the same data only from the base table. Materialized views can significantly improve the performance of workloads that have the characteristic of common and repeated queries.

https://cloud.google.com/bigquery/docs/materialized-views-intro

Remote Functions

https://cloud.google.com/blog/products/data-analytics/extending-bigquery-functions

https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions

https://towardsdatascience.com/remote-functions-in-bigquery-af9921498438

Reservations

BigQuery Reservations enables you to switch between on-demand pricing and flat-rate pricing. With flat-rate pricing, you purchase dedicated query processing capacity. You can allocate this capacity across your organization, by reserving pools of capacity for different projects or different parts of your organization. You can also combine the two billing models, taking advantage of both on-demand and flat-rate pricing.

https://cloud.google.com/bigquery/docs/reservations-intro

https://cloud.google.com/blog/topics/developers-practitioners/bigquery-admin-reference-guide-jobs-reservation-model

Scheduling queries

https://cloud.google.com/bigquery/docs/scheduling-queries

Scripting

https://jimbeepbeep.medium.com/getting-started-with-bigquery-scripting-45bdd968010c

https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting

Audit Logs

https://cloud.google.com/blog/products/data-analytics/bigquery-audit-logs-pipelines-analysis

BigQuery BI Engine Preferred Tables

https://cloud.google.com/blog/products/data-analytics/now-in-preview-bigquery-bi-engine-preferred-tables

Controlling s

https://cloud.google.com/bigquery/docs/controlling-s

Optimizing BigQuery with Looker

https://cloud.google.com/blog/topics/developers-practitioners/optimizing-your-google-cloud-spend-bigquery-and-looker

Optimizing strategies

https://fares-daoud.medium.com/how-i-have-optimized-bigquery-s-for-my-company-948df95b9f0d

Organizing BigQuery resources

Resource Hierarchy.

https://cloud.google.com/bigquery/docs/resource-hierarchy

Dataform

Dataform is a platform to manage data in BigQuery, Snowflake, Redshift, and other data warehouses.

UDF

https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions

https://towardsdatascience.com/bigquery-udfs-complete-guide-181cbdaea55b

Testing UDF with Dataform

https://cloud.google.com/blog/topics/data-warehousing/learn-how-to-use-the-dataform-cli-tool-to-unit-test-udf

Collation

https://medium.com/codex/using-collation-in-google-bigquery-e63d34ee4799

Testable SQL

https://tufin.medium.com/testable-bigquery-sql-61a911e35ab5

transaction and session

multi statement transactions

https://cloud.google.com/bigquery/docs/reference/standard-sql/transactions

writing queries in sessions

https://cloud.google.com/bigquery/docs/sessions-write-queries

transactions with multiple queries with sessions

https://dev.to/stack-labs/bigquery-transactions-over-multiple-queries-with-sessions-2ll5

MERGE

https://medium.com/plumbersofdatascience/incremental-data-ingestion-in-bigquery-using-merge-1387e49c7eb5

Data Fusion

Data Fusion is built using open source project CDAP. It is a GUI based data integration service for building and managing data pipelines.

https://cloud.google.com/data-fusion/

Import CSV data into BigQuery using Data Fusion

https://codelabs.developers.google.com/codelabs/batch-csv-cdf-bq

Datastream and Data Fusion

https://cloud.google.com/blog/products/vmware-engine/visual-data-with-bigquery-and-google-cloud-vmware-engine

Dremel

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. BigQuery is based on Dremel.

https://www.wired.com/2012/08/googles-dremel-makes-big-data-look-small/

Dremel has been in production at Google since 2006. A selection of use cases for Dremel at Google include analysis of:

Crawled web documents
Spam
Build system results
Crash reports

Further, there are two ways to use Dremel outside of Google. The first is Google’s BigQuery service, which Google provides as part of its cloud offering. The second is Apache Drill, effectively an open source re-implementation of Dremel. Cloudera Impala is also influenced by Dremel. As is Presto and Dremio. All these are addressing SQL over Hadoop issues.

http://www.goldsborough.me/distributed-systems/2019/05/18/21-09-00-a_look_at_dremel/

INFORMATION_SCHEMA

INFORMATION_SCHEMA is a series of views that provide access to metadata about datasets, routines, tables, views, jobs, reservations, and streaming data.

https://cloud.google.com/bigquery/docs/information-schema-intro

You can query the INFORMATION_SCHEMA.JOBS_TIMELINE_BY_* views to retrieve real-time BigQuery metadata by timeslice. This view contains currently running and completed jobs. Data is retained for 180 days.

https://cloud.google.com/bigquery/docs/information-schema-jobs-timeline

Monitor and analyze BigQuery performance using information schema

https://cloud.google.com/blog/topics/developers-practitioners/monitor-analyze-bigquery-performance-using-information-schema

Denormalize whenever possible

https://cloud.google.com/bigquery/docs/best-practices-performance-input#denormalize_data_whenever_possible

Building the world’s largest data warehouse

https://www.youtube.com/watch?v=1gYUGv_omJA

BigQuery nested and repeated fields

https://www.youtube.com/watch?v=STo98QUKDS8

SEARCH INDEX

https://levelup.gitconnected.com/enhancing-bigquery-search-features-with-search-index-771c1eec186e

BigQuery, Snowflake and Redshift

Compare and review various Data Warehousing solutions: BigQuery, Snowflake and RedShift.

Snowflake, Star, Galaxy

Star Schema in a data warehouse, in which the center of the star can have one fact table and a number of associated dimension tables. It is known as star schema as its structure resembles a star. The Star Schema data model is the simplest type of Data Warehouse schema. It is also known as Star Join Schema and is optimized for querying large data sets.

Snowflake Schema in a data warehouse is a logical arrangement of tables in a multidimensional database such that the ER diagram resembles a snowflake shape. A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. The dimension tables are normalized which splits data into additional tables.

A Galaxy Schema contains two fact table that share dimension tables between them. It is also called Fact Constellation Schema. The schema is viewed as a collection of stars hence the name Galaxy Schema.

https://www.guru99.com/star-snowflake-data-warehousing.html

Denormalization

Denormalization is a strategy used on a previously-normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data.

Transactions

https://cloud.google.com/bigquery/docs/reference/standard-sql/transactions

Write API

https://cloud.google.com/blog/topics/developers-practitioners/bigquery-write-api-explained-overview-write-api

Streaming JSON data via Storage Write API

https://cloud.google.com/blog/topics/developers-practitioners/learn-how-stream-json-data-bigquery-using-new-bigquery-storage-write-api

Storage Write API with Kafka to stream deliver data exactly once

https://medium.com/@wojcikpawel/exactly-once-delivery-in-bigquerys-storage-write-api-67885c5c5e16

Public Datasets

https://cloud.google.com/bigquery/public-data

Query queues for concurrency

https://medium.com/codex/bigquery-now-supporting-query-queues-378a65fdc9c1

BigQuery Standard SQL

BigQuery supports two different SQL dialects: standard SQL and legacy SQL. Legacy SQL may be useful if you want to test queries coming from legacy applications.

https://cloud.google.com/blog/topics/developers-practitioners/bigquery-explained-querying-your-data

BigQuery was developed as an internal product within Google and was initially realized to process log records. The query engine Dremel was able to support a limited set of SQL operations that are now defined as Legacy SQL.

https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql

Table Snapshots

https://cloud.google.com/blog/products/data-analytics/google-bigquery-table-snapshots-for-data-backups

Newer features in BigQuery SQL

https://cloud.google.com/blog/topics/developers-practitioners/spring-forward-bigquery-user-friendly-sql

https://cloud.google.com/blog/topics/developers-practitioners/shine-user-friendly-sql-bigquery

Analytics

https://cloud.google.com/bigquery/docs/query-overview

Analytics Hub

https://cloud.google.com/bigquery/docs/analytics-hub-introduction

BigQuery BI Engine

BigQuery BI Engine is a fast, in-memory analysis service that allows you to analyze data stored in BigQuery.

https://cloud.google.com/bigquery/docs/bi-engine-intro

https://cloud.google.com/blog/products/data-analytics/bigquery-bi-engine-generally-available

BigQuery BI Engine Preferred tables

https://cloud.google.com/blog/products/data-analytics/now-in-preview-bigquery-bi-engine-preferred-tables

Data Studio

https://cloud.google.com/bigquery/docs/bi-engine-data-studio

Looker

https://cloud.google.com/bigquery/docs/analyze-data-looker

Tableau

https://cloud.google.com/bigquery/docs/analyze-data-tableau

Connected Sheets

https://cloud.google.com/bigquery/docs/connected-sheets

Excel connector

https://cloud.google.com/bigquery/docs/bigquery-connector-for-excel

Datalab

https://cloud.google.com/architecture/analyzing-financial-time-series-using-bigquery-and-cloud-datalab

Data Security

Data Governance

Data governance is a principled approach to manage data during its lifecycle — from acquisition, to use, to disposal. Your data governance program clearly outlines policies, procedures, responsibilities, and controls surrounding data activities. This program helps to ensure that information is collected, maintained, used, and disseminated in such a way that both meets your organization's data integrity and security needs, and also helps empower your employees to discover and use the data to its fullest potential.

https://cloud.google.com/bigquery/docs/data-governance

https://cloud.google.com/blog/products/data-analytics/data-governance-new-ways-to-securely-access-and-discover-data

BigQuery Access Control

https://medium.com/@VishalBulbule/access-control-in-bigquery-d5d800654f47

BigQuery Access Control Examples

https://cloud.google.com/bigquery/docs/access-control-examples

BigQuery Data Encryption

https://cloud.google.com/bigquery/docs/encryption-at-rest

BigQuery column-level security

https://cloud.google.com/bigquery/docs/column-level-security

BigQuery column-level encryption functions

https://cloud.google.com/blog/products/identity-security/announcing-new-bigquery-capabilities-to-help-secure-sensitive-data

Restricting access to columns

https://medium.com/plumbersofdatascience/restrict-access-to-columns-on-bigquery-1550895b3356

BigQuery row-level security

https://cloud.google.com/blog/products/data-analytics/bigquery-provides-tighter-controls-over-data-access

https://medium.com/@aaronblythe_70600/using-bigquery-row-level-security-on-an-entire-table-658c489c134e

Column based data masking

https://medium.com/codex/google-improves-data-security-in-bigquery-195a90cc5b85

Dynamic data masking

https://cloud.google.com/blog/products/identity-security/announcing-new-bigquery-capabilities-to-help-secure-sensitive-data

https://medium.com/plumbersofdatascience/dynamic-data-masking-on-bigquery-ae3d004b496c

Using Connected Sheets with BigQuery

With Connected Sheets, you can access, analyze, visualize, and share billions of rows of BigQuery data from your Sheets spreadsheet.

You can also do the following:

Collaborate with partners, analysts, or other stakeholders in a familiar spreadsheet interface.
Ensure a single source of truth for data analysis without additional spreadsheet exports.
Streamline your reporting and dashboard workflows.

https://cloud.google.com/bigquery/docs/connected-sheets

BigQuery Migration Service (BQMS)

https://cloud.google.com/blog/products/data-analytics/automated-sql-translation-to-accelerate-bigquery-migrations

Pricing

BigQuery is a serverless data analytics platform. You don't need to provision individual instances or virtual machines to use BigQuery. Instead, BigQuery automatically allocates computing resources as you need them. You can also reserve compute capacity ahead of time in the form of slots, which represent virtual CPUs. The pricing structure of BigQuery reflects this design.

BigQuery pricing has two main components:

Analysis pricing is the to process queries, including SQL queries, user-defined functions, scripts, and certain data manipulation language (DML) and data definition language (DDL) statements that scan tables.
Storage pricing is the to store data that you load into BigQuery.

https://cloud.google.com/bigquery/pricing

BigQuery Flex Slots

https://cloud.google.com/blog/products/data-analytics/introducing-bigquery-flex-slots

Flat rate

https://cloud.google.com/bigquery/pricing#flat_rate_pricing

Controlling cost

https://cloud.google.com/blog/topics/developers-practitioners/controlling-your-bigquery-costs

Resource Charts and Slot Estimator

https://cloud.google.com/blog/products/data-analytics/bigquery-administrator-hub-helps-manage-bigquery-at-scale

Regional performances

https://medium.com/google-cloud/bigquery-tell-me-your-region-i-will-tell-you-your-speed-41dcf42b8cc

Sandbox

https://cloud.google.com/bigquery/docs/sandbox

Importing existing BigQuery Resources to Terraform

https://medium.com/trendyol-tech/importing-existing-bigquery-resources-to-terraform-automatically-f52a36b6e819

Looker and BigQuery

Looker is an enterprise platform for business intelligence, data applications, and embedded analytics. Looker helps you explore, share, and visualize your company's data so that you can make better business decisions.

https://cloud.google.com/bigquery/docs/looker

BigQuery Performance Optimization

https://towardsdatascience.com/14-ways-to-optimize-bigquery-sql-for-ferrari-speed-at-honda--632ec705979

BigQuery GIS

BigQuery GIS uniquely combines the serverless architecture of BigQuery with native support for geospatial analysis, so you can augment your analytics workflows with location intelligence.

https://mentin.medium.com/bigquery-geospatial-query-tricks-8ebb4453ab5e