I was testing a few things about Bigquery arrays with my coworker Reese. The table below was created by her.
Create an array in Bigquery
Reese’s way
My way
They are essentially the same. Union all is ‘,’ and array() is []. I am just lazier to type the words, plus I do not like having select inside select. I prefer the logic to be shown in the simplest way in the code.
The output table looks like this
Array_to_string and to_json_string
The next thing we tried is to make the table shown as string, not array.
We name the table we just created data. So the simplest way Reese used is below
select name, TO_JSON_STRING(pets), sum(pageviews)
from data
group by 1,2
The output is below
Updated with Mosha Pasumansky ‘s comment
Another way to make it string format is by using
select name, format(“%t”,pets)pets,pageviews
from data
And the results look like below
Array Aggregation
Next trying to combine the arrays
select data.name, array_agg(p)
from data, unnest(pets) as p
group by 1
select data.name, array_agg(p)
, array_agg(pageviews) as page_views
from data, unnest(pets) as p
group by 1
If adding pageviews to the array_agg, the pageviews were populated for each row. So it is kind of duplicated.
If you do a sum for the pageviews, you will get the sum of all the duplicated rows.
select data.name, array_agg(p)
, sum(pageviews) as pageviews
from data, unnest(pets) as p
group by 1
So currently the only way we can think of to get the data we want is by doing to_json_string.
Query from JSON string
I was reading Bigquery’s documentation about json_extract and it has this example.
SELECT JSON_EXTRACT(json_text, '$.class.students[0]') AS first_student
FROM UNNEST([
'{"class" : {"students" : [{"name" : "Jane"}]}}',
'{"class" : {"students" : []}}',
'{"class" : {"students" : [{"name" : "John"}, {"name": "Jamie"}]}}'
]) AS json_text;+-----------------+
| first_student |
+-----------------+
| {"name":"Jane"} |
| NULL |
| {"name":"John"} |
+-----------------+
I tried the same on my pets dataset above and it did not work. Turned out it is because my json string is in a list. To make it work, the syntax should be
select pets, json_extract_scalar(pets,’$[0].name’) as name
How to keep rows with empty arrays
Reese and I recently found that if using unnest directly, the rows with empty arrays will not appear. The solution is to left join unnest. To demonstrate, first we created a table with empty arrays.
create table array_test (
some_field string,
some_other_field array<struct<a string, b string>>
) as (
select ‘a’ some_field, null some_other_field union all
select ‘b’ some_field, [] some_other_field union all
select ‘c’ some_field, [struct(‘:)’ as a, ‘:(‘ as b), struct(‘:D’ as a, ‘>:(‘ as b)]
)
Now we use unnest directly
SELECT
some_field,
a
FROM
array_test,
UNNEST(some_other_field)
It will only give you the none empty row
Now if we use left join unnest below
SELECT
some_field,
a
FROM
array_test
LEFT JOIN
UNNEST(some_other_field)
The results include the rows with empty arrays.
Resources:
Array_agg: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#array_agg
Array_to_string: https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#converting-arrays-to-strings
Array_to_string: https://cloud.google.com/bigquery/docs/reference/standard-sql/array_functions#array_to_string
To_json_string: https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions#to_json_string
Json_extract and json_extract_scalar:
https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions#to_json_string