Leaf #0 – Top 13 Techniques for Google BigQuery Optimization
Dear friends, Happy International Women’s Day. Last day i met a surprise friend called Lakshmi in the meetup where we get together to share the learning on big data on the cloud. He expressed that dataottam team, is doing awesome via blogs, meetups, and much more. We energized with his appreciation. As a token of gift we the dataottam team started a initiative called Leaf (Learning from this week ), where we will bring you insights and value from all our deep dive learning from that week.
In this Leaf #0 episode let’s converse about the Google BigQuery optimization. BigQuery is Google’s server less, highly scalable, low cost enterprise data warehouse designed to make all our data analysts productive. We can use SQL to analyze the data and to find meaningful insights from the large data sets. BigQuery enables us to analyze all our data by creating a logical data warehouse over managed, columnar storage, apart from that it allows us to capture and analyze data in real-time using the powerful streaming ingestion capability. And we have cheering news, BigQuery is free for up to 1TB of data analyzed each month and 10GB of data stored. All my free GCP training’s, projects goes well with out any penny.
When we use Google BigQuery, our data is stored on a Colossus filesystem; which is the same technology underpinning other main Google services and all data is encrypted at rest. It can be interacted in many ways like web console, client-side API (Python, Go, C#, Java, Node.js, PHP and Ruby), a command line interface, standard interfaces like JDBC and ODBC, and 3rd party tools (eg. Matillion ). Whenever we execute a SQL statement, it is parsed and optimized internally by an execution engine called Dremel. To achieve the vital aim of scalability, Google internally splits every SQL task between many workers compute nodes called slots, and we can call this approach as MPP (Massively Parallel Processing).
- Avoid SELECT *, select * will selects all columns. So select specific columns to reduce the amount of data scanned.
- LIMIT and WHERE clauses doesn’t affect the query cost. LIMIT is a final step which restricts the amount of data shown, after the full query has been executed. WHERE clause is applied after data has been scanned. So we can use LIMIT to avoid having to order all results of a query. Then query it just once, using a superset of all the necessary WHERE clauses, and materialize an intermediate table; then repeatedly query the intermediate table, using individual WHERE clauses.
- WITH will not materialize at runtime, hence use a intermediary table rather than a CTE (Common Table Expression).
- Leverage the Approximate Aggregations like COUNT(DISTINCT).
- Don’t use SELECT * …LIMIT types, rather use preview within console, use the 2bq head2 command or we can use the API.
- Avoid HAVING clauses when able, instead use WHERE to reduce the amount of data being shuffled.
- Optimize our WHERE clauses by joining largest table first, smallest table next and then the other tables, in descending size order.
- DATE Partitioning for a table is great advantage, so we can simple use the _PARTIONTIME pseudocolumn to fetch data from the partitions specified by the date filter. It helps the amount of data scanned.
- Use SELECT with a wildcard in a FROM clause with manually sharded table.
- Leverage Analytic Functions like SUM, CORR, STDDEV, and much more to reduce duplicated scanning efforts.
Click here for more details… happy reading and happy sharing.
See u in the next Leaf (Learning from this week).