This notebook is an exercise in the SQL course. You can reference the tutorial at this link.

Introduction

介绍

Try writing some SELECT statements of your own to explore a large dataset of air pollution measurements.

尝试编写一些自己的 SELECT 语句来探索空气污染测量的大型数据集。

Run the cell below to set up the feedback system.

运行下面的单元格来设置反馈系统。

# Set up feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.sql.ex2 import *
print("Setup Complete")

Using Kaggle's public dataset BigQuery integration.
Setup Complete

The code cell below fetches the global_air_quality table from the openaq dataset. We also preview the first five rows of the table.

下面的代码单元从openaq数据集中获取global_air_quality表。我们还预览表的前五行。

from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "openaq" dataset
dataset_ref = client.dataset("openaq", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "global_air_quality" table
table_ref = dataset_ref.table("global_air_quality")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the "global_air_quality" table
client.list_rows(table, max_results=5).to_dataframe()

Using Kaggle's public dataset BigQuery integration.

	location	city	country	pollutant	value	timestamp	unit	source_name	latitude	longitude	averaged_over_in_hours	location_geom
0	Borówiec, ul. Drapałka	Borówiec	PL	bc	0.85217	2022-04-28 07:00:00+00:00	µg/m³	GIOS	1.0	52.276794	17.074114	POINT(52.276794 1)
1	Kraków, ul. Bulwarowa	Kraków	PL	bc	0.91284	2022-04-27 23:00:00+00:00	µg/m³	GIOS	1.0	50.069308	20.053492	POINT(50.069308 1)
2	Płock, ul. Reja	Płock	PL	bc	1.41000	2022-03-30 04:00:00+00:00	µg/m³	GIOS	1.0	52.550938	19.709791	POINT(52.550938 1)
3	Elbląg, ul. Bażyńskiego	Elbląg	PL	bc	0.33607	2022-05-03 13:00:00+00:00	µg/m³	GIOS	1.0	54.167847	19.410942	POINT(54.167847 1)
4	Piastów, ul. Pułaskiego	Piastów	PL	bc	0.51000	2022-05-11 05:00:00+00:00	µg/m³	GIOS	1.0	52.191728	20.837489	POINT(52.191728 1)

Exercises

练习

1) Units of measurement

1) 测量单位

Which countries have reported pollution levels in units of "ppm"? In the code cell below, set first_query to an SQL query that pulls the appropriate entries from the country column.

哪些国家以ppm为单位报告污染水平？在下面的代码单元中，将first_query设置为从country列中提取适当条目的 SQL 查询。

In case it's useful to see an example query, here's some code from the tutorial:

如果查看示例查询有用，请参阅本教程中的一些代码：

query = """
        SELECT city
        FROM `bigquery-public-data.openaq.global_air_quality`
        WHERE country = 'US'
        """

# Query to select countries with units of "ppm"
first_query = """
            SELECT distinct country
            FROM `bigquery-public-data.openaq.global_air_quality`
            WHERE unit = 'ppm'
            """ # Your code goes here

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
first_query_job = client.query(first_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
first_results = first_query_job.to_dataframe()

# View top few rows of results
print(first_results.head())

# Check your answer
q_1.check()

  country
0      AR
1      IL
2      TW
3      CO
4      EC

Correct

For the solution, uncomment the line below.

q_1.solution()

Solution:


first_query = """
              SELECT country
              FROM `bigquery-public-data.openaq.global_air_quality`
              WHERE unit = "ppm"
              """

# Or to get each country just once, you could use
first_query = """
              SELECT DISTINCT country
              FROM `bigquery-public-data.openaq.global_air_quality`
              WHERE unit = "ppm"
              """

2) High air quality

2) 空气质量高

Which pollution levels were reported to be exactly 0?

据报告，哪些污染水平恰好为 0？

Set zero_pollution_query to select all columns of the rows where the value column is 0.
设置zero_pollution_query以选择value列为 0 的行的所有列。
Set zero_pollution_results to a pandas DataFrame containing the query results.
将 zero_pollution_results 设置为包含查询结果的 pandas DataFrame。

# Query to select all columns where pollution levels are exactly 0
zero_pollution_query = """
            SELECT *
            FROM `bigquery-public-data.openaq.global_air_quality`
            WHERE value = 0
            """ # Your code goes here

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(zero_pollution_query, job_config=safe_config)

# API request - run the query and return a pandas DataFrame
zero_pollution_results = query_job.to_dataframe() # Your code goes here

# print(zero_pollution_results.head())
zero_pollution_results.head()

	location	city	country	pollutant	timestamp	unit	source_name	latitude	longitude	averaged_over_in_hours	location_geom
0	Żary, ul. Szymanowskiego 8	Żary	PL	bc	2022-05-05 02:00:00+00:00	µg/m³	GIOS	1.0	51.642656	15.127808	POINT(51.642656 1)
1	Starachowice, ul. Złota	Starachowice	PL	bc	2022-05-08 11:00:00+00:00	µg/m³	GIOS	1.0	51.050611	21.084175	POINT(51.050611 1)
2	Kraków, ul. Bulwarowa	Kraków	PL	bc	2022-05-07 13:00:00+00:00	µg/m³	GIOS	1.0	50.069308	20.053492	POINT(50.069308 1)
3	Zielonka, Bory Tucholskie	Zielonka	PL	bc	2022-05-15 11:00:00+00:00	µg/m³	GIOS	1.0	53.662136	17.933986	POINT(53.662136 1)
4	Żagań, ul. Kochanowskiego	Żagań	PL	bc	2022-05-02 13:00:00+00:00	µg/m³	GIOS	1.0	51.615447	15.301667	POINT(51.615447 1)

# Check your answer
q_2.check()

Correct

For the solution, uncomment the line below.

对于解决方案，请取消注释下面的行。

q_2.solution()

Solution:


zero_pollution_query = """
                       SELECT *
                       FROM `bigquery-public-data.openaq.global_air_quality`
                       WHERE value = 0
                       """

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(zero_pollution_query, job_config=safe_config)

zero_pollution_results = query_job.to_dataframe()

That query wasn't too complicated, and it got the data you want. But these SELECT queries don't organizing data in a way that answers the most interesting questions. For that, we'll need the GROUP BY command.

该查询并不太复杂，并且它得到了您想要的数据。但这些 SELECT 查询不会以回答最有趣问题的方式组织数据。为此，我们需要 GROUP BY 命令。

If you know how to use groupby() in pandas, this is similar. But BigQuery works quickly with far larger datasets.

如果您知道如何在 pandas 中使用 groupby()，则这是类似的。但 BigQuery 可以快速处理更大的数据集。

Fortunately, that's next.

幸运的是，接下来就是这样了。

Keep going

继续前进

GROUP BY clauses and their extensions give you the power to pull interesting statistics out of data, rather than receiving it in just its raw format.

GROUP BY 语句及其扩展使您能够从数据中提取有趣的统计数据，而不是接收数据只是它的原始格式。

02.exercise-select-from-where【练习：SELECT、FROM和WHERE】