Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

05.exercise-as-with【练习:AS和WITH】

This notebook is an exercise in the SQL course. You can reference the tutorial at this link.


import pandas as pd

Introduction

介绍

You are getting to the point where you can own an analysis from beginning to end. So you'll do more data exploration in this exercise than you've done before. Before you get started, run the following set-up code as usual.

您现在已经可以从头到尾进行分析了。 因此,您将在本练习中进行比以前更多的数据探索。 在开始之前,请照常运行以下设置代码。

# Set up feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.sql.ex5 import *
print("Setup Complete")
Using Kaggle's public dataset BigQuery integration.
Setup Complete

You'll work with a dataset about taxi trips in the city of Chicago. Run the cell below to fetch the chicago_taxi_trips dataset.

您将使用有关芝加哥市出租车行程的数据集。 运行下面的单元格以获取chicago_taxi_trips数据集。

from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "chicago_taxi_trips" dataset
dataset_ref = client.dataset("chicago_taxi_trips", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)
Using Kaggle's public dataset BigQuery integration.

Exercises

练习

You are curious how much slower traffic moves when traffic volume is high. This involves a few steps.

您想知道当交通流量很大时,交通速度会减慢多少。 这涉及几个步骤。

1) Find the data

1) 查找数据

Before you can access the data, you need to find the table name with the data.

在访问数据之前,您需要找到包含数据的表名。

Hint: Tab completion is helpful whenever you can't remember a command. Type client. and then hit the tab key. Don't forget the period before hitting tab.

提示:每当您忘记命令时,制表符补全都会很有帮助。 输入client.,然后按 Tab 键。 不要忘记在点击 Tab 之前需要加上句号。

# Your code here to find the table name
tables = list(client.list_tables(dataset))
for tab in tables:
    print(tab.full_table_id, tab.table_id)
bigquery-public-data:chicago_taxi_trips.taxi_trips taxi_trips
# Write the table name as a string below
# table_name = 'taxi_trips'
table_name = tab.table_id

# Check your answer
q_1.check()

Correct

For the solution, uncomment the line below.

对于解决方案,请取消注释下面的行。

q_1.solution()

Solution:


# List all the tables in the dataset
tables = list(client.list_tables(dataset))

# Print names of all tables in the dataset (there is only one!)
for table in tables:  
    print(table.table_id)

table_name = 'taxi_trips'

2) Peek at the data

2) 查看数据

Use the next code cell to peek at the top few rows of the data. Inspect the data and see if any issues with data quality are immediately obvious.

使用下一个代码单元格查看数据的前几行。 检查数据并查看数据质量问题是否很明显。

# Your code here
client.list_rows(tables[0], max_results=5).to_dataframe()
unique_key taxi_id trip_start_timestamp trip_end_timestamp trip_seconds trip_miles pickup_census_tract dropoff_census_tract pickup_community_area dropoff_community_area ... extras trip_total payment_type company pickup_latitude pickup_longitude pickup_location dropoff_latitude dropoff_longitude dropoff_location
0 89dbb2fc5b8dad6e95f3da319010d3995b9aed3a 4efb4a59d8c600a9d2a8d14362769bed234568a573dc5f... 2014-12-01 18:30:00+00:00 2014-12-01 18:30:00+00:00 420 0.8 NaN NaN NaN NaN ... 1.0 6.65 Cash None NaN NaN None NaN NaN None
1 33d91796f8cc1c00b68addf4b95b5c0dc01c95d0 ac3197af61ac119894f5d339834595e2d1e28f8d6609e7... 2014-11-28 06:30:00+00:00 2014-11-28 06:45:00+00:00 540 2.2 NaN NaN NaN NaN ... 0.0 8.25 Cash None NaN NaN None NaN NaN None
2 9759ecc279a90f20e346d0eab065a8275d28b2e0 ac3197af61ac119894f5d339834595e2d1e28f8d6609e7... 2014-11-28 08:15:00+00:00 2014-11-28 08:15:00+00:00 180 0.5 NaN NaN NaN NaN ... 0.0 4.45 Cash None NaN NaN None NaN NaN None
3 a3059e04c539b9f333116268ee80bc8790e31b16 ac3197af61ac119894f5d339834595e2d1e28f8d6609e7... 2014-11-28 05:30:00+00:00 2014-11-28 05:30:00+00:00 180 0.6 NaN NaN NaN NaN ... 0.0 6.84 Credit Card None NaN NaN None NaN NaN None
4 e9fbfd63f479544264ab62ffee1663644f089c4f ac3197af61ac119894f5d339834595e2d1e28f8d6609e7... 2014-11-28 06:15:00+00:00 2014-11-28 06:15:00+00:00 480 2.5 NaN NaN NaN NaN ... 0.0 8.25 Cash None NaN NaN None NaN NaN None

5 rows × 23 columns

After deciding whether you see any important issues, run the code cell below.

确定您是否发现任何重要问题后,运行下面的代码单元。

# Check your answer (Run this code cell to receive credit!)
q_2.solution()

Solution:
You can see the data by calling:

# Construct a reference to the "taxi_trips" table
table_ref = dataset_ref.table("taxi_trips")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the "taxi_trips" table
client.list_rows(table, max_results=5).to_dataframe()

Some location fields have values of None or NaN. That is a problem if we want to use those fields.

3) Determine when this data is from

3) 确定该数据的来源时间

If the data is sufficiently old, we might be careful before assuming the data is still relevant to traffic patterns today. Write a query that counts the number of trips in each year.

如果数据相当陈旧,我们在假设数据仍然与当今的流量模式相关之前可能要小心。 编写一个查询来计算每年的出行次数。

Your results should have two columns:

您的结果应该有两列:

  • year - the year of the trips
  • year - 旅行的年份
  • num_trips - the number of trips in that year
  • num_trips - 那一年的旅行次数

Hints:

提示:

  • When using GROUP BY and ORDER BY, you should refer to the columns by the alias year that you set at the top of the SELECT query.
  • 使用 GROUP BYORDER BY 时,您应该通过在 SELECT 查询顶部设置的别名year引用列。
  • The SQL code to SELECT the year from trip_start_timestamp is SELECT EXTRACT(YEAR FROM trip_start_timestamp)
  • trip_start_timestamp SELECT 年份的 SQL 代码是 SELECT EXTRACT(YEAR FROM trip_start_timestamp)
  • The FROM field can be a little tricky until you are used to it. The format is:
  • FROM 字段可能有点棘手,直到您习惯为止。 格式为:
    1. A backick (the symbol `).
    2. 反斜杠(符号`)。
    3. The project name. In this case it is bigquery-public-data.
    4. 项目名称。 在本例中,它是bigquery-public-data
    5. A period.
    6. 一个句点。
    7. The dataset name. In this case, it is chicago_taxi_trips.
    8. 数据集名称。 在本例中,它是chicago_taxi_trips
    9. A period.
    10. 一个句点。
    11. The table name. You used this as your answer in 1) Find the data.
    12. 表名。 您将此作为 1) 查找数据 中的答案。
    13. A backtick (the symbol `).
    14. 反引号(符号`)。
# Your code goes here
rides_per_year_query = """SELECT EXTRACT(YEAR FROM trip_start_timestamp) AS year, 
                              COUNT(1) AS num_trips
                       FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                       GROUP BY year
                       order by year"""

# Set up the query (cancel the query if it would use too much of 
# your quota)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
rides_per_year_query_job = client.query(rides_per_year_query, job_config=safe_config) #____ # Your code goes here
# API request - run the query, and return a pandas DataFrame
rides_per_year_result = rides_per_year_query_job.to_dataframe() #____ # Your code goes here

# View results
print(rides_per_year_result)
   year  num_trips
0  2013    8057496
1  2014   14975761
2  2015   13942179
3  2016   20503949
4  2017   22837699
5  2018   18347736
6  2019    3876273
7  2022      35089
8  2023      13102
# Check your answer
q_3.check()

Correct

For a hint or the solution, uncomment the appropriate line below.

如需提示或解决方案,请取消注释下面相应的行。

# q_3.hint()
q_3.solution()

Solution:


rides_per_year_query = """
                       SELECT EXTRACT(YEAR FROM trip_start_timestamp) AS year, 
                              COUNT(1) AS num_trips
                       FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                       GROUP BY year
                       ORDER BY year
                       """

# Set up the query (cancel the query if it would use too much of 
# your quota)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
rides_per_year_query_job = client.query(rides_per_year_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
rides_per_year_result = rides_per_year_query_job.to_dataframe()

4) Dive slightly deeper

4) 稍微深入一点

You'd like to take a closer look at rides from 2016. Copy the query you used above in rides_per_year_query into the cell below for rides_per_month_query. Then modify it in two ways:

您想仔细查看 2016 年的游乐设施。将上面在 rides_per_year_query 中使用的查询复制到下面的 rides_per_month_query 单元格中。 然后通过两种方式修改:

  1. Use a WHERE clause to limit the query to data from 2016.
  2. 使用 WHERE 子句将查询限制为 2016 年以来的数据。
  3. Modify the query to extract the month rather than the year.
  4. 修改查询以提取月份而不是年份。
# Your code goes here
rides_per_month_query = """SELECT EXTRACT(MONTH FROM trip_start_timestamp) AS month, 
                              COUNT(1) AS num_trips
                       FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                       where EXTRACT(YEAR FROM trip_start_timestamp) = 2016
                       GROUP BY month
                       order by month""" 

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
rides_per_month_query_job = client.query(rides_per_month_query, job_config=safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
rides_per_month_result = rides_per_month_query_job.to_dataframe() # Your code goes here

# View results
print(rides_per_month_result)

# Check your answer
q_4.check()
    month  num_trips
0       1     694951
1       2     713740
2       3     784291
3       4    1387172
4       5    1215852
5       6    1235555
6       7    2209304
7       8    2533202
8       9    2488305
9      10    2681787
10     11    2315812
11     12    2243978

Correct

For a hint or the solution, uncomment the appropriate line below.

如需提示或解决方案,请取消注释下面相应的行。

#q_4.hint()
q_4.solution()

Solution:


rides_per_month_query = """
                        SELECT EXTRACT(MONTH FROM trip_start_timestamp) AS month, 
                               COUNT(1) AS num_trips
                        FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                        WHERE EXTRACT(YEAR FROM trip_start_timestamp) = 2016
                        GROUP BY month
                        ORDER BY month
                        """

# Set up the query (cancel the query if it would use too much of 
# your quota)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
rides_per_month_query_job = client.query(rides_per_month_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
rides_per_month_result = rides_per_month_query_job.to_dataframe()

5) Write the query

5) 编写查询

It's time to step up the sophistication of your queries. Write a query that shows, for each hour of the day in the dataset, the corresponding number of trips and average speed.

是时候提高查询的复杂性了。 编写一个查询,显示数据集中一天内每个小时相应的出行次数和平均速度。

Your results should have three columns:

您的结果应包含三列:

  • hour_of_day - sort by this column, which holds the result of extracting the hour from trip_start_timestamp.
  • hour_of_day - 按此列排序,该列保存从 trip_start_timestamp 中提取小时的结果。
  • num_trips - the count of the total number of trips in each hour of the day (e.g. how many trips were started between 6AM and 7AM, independent of which day it occurred on).
  • num_trips - 一天中每个小时的行程总数(例如,早上 6 点到上午 7 点之间开始的行程次数,与发生在哪一天无关)。
  • avg_mph - the average speed, measured in miles per hour, for trips that started in that hour of the day. Average speed in miles per hour is calculated as 3600 * SUM(trip_miles) / SUM(trip_seconds). (The value 3600 is used to convert from seconds to hours.)
  • avg_mph - 当天该小时开始的行程的平均速度,以英里每小时为单位。 平均速度(以英里/小时为单位)的计算方式为3600 * SUM(trip_miles) / SUM(trip_seconds)。 (值 3600 用于从秒转换为小时。)

Restrict your query to data meeting the following criteria:

将查询限制为满足以下条件的数据:

  • a trip_start_timestamp between 2016-01-01 and 2016-04-01
  • 2016-01-012016-04-01 之间的 trip_start_timestamp
  • trip_seconds > 0 and trip_miles > 0
  • trip_seconds > 0 和 trip_miles > 0

You will use a common table expression (CTE) to select just the relevant rides. Because this dataset is very big, this CTE should select only the columns you'll need to create the final output (though you won't actually create those in the CTE -- instead you'll create those in the later SELECT statement below the CTE).

您将使用通用表表达式 (CTE) 来选择相关游乐设施。 因为这个数据集非常大,所以这个 CTE 应该只选择创建最终输出所需的列(尽管您实际上不会在 CTE 中创建这些列 - 取而代之,您将CTE 下方的 SELECT 语句中创建这些列 )。

This is a much harder query than anything you've written so far. Good luck!

这是一个比您迄今为止编写的任何查询都要困难得多的查询。 祝你好运!

# Your code goes here
speeds_query = """
               WITH RelevantRides AS
               (
                   SELECT EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day, 
                          trip_miles, 
                          trip_seconds
                   FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                   WHERE trip_start_timestamp > '2016-01-01' AND 
                         trip_start_timestamp < '2016-04-01' AND 
                         trip_seconds > 0 AND 
                         trip_miles > 0
               )
               SELECT hour_of_day, 
                      COUNT(1) AS num_trips, 
                      3600 * SUM(trip_miles) / SUM(trip_seconds) AS avg_mph
               FROM RelevantRides
               GROUP BY hour_of_day
               ORDER BY hour_of_day
               """

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
speeds_query_job = client.query(speeds_query, safe_config) # Your code here

# API request - run the query, and return a pandas DataFrame
speeds_result = speeds_query_job.to_dataframe() # Your code here

# View results
print(speeds_result)

# Check your answer
q_5.check()
    hour_of_day  num_trips    avg_mph
0             0      55346  24.217470
1             1      47918  21.583542
2             2      36281  22.022014
3             3      25443  23.573634
4             4      19259  37.167362
5             5      23802  41.129593
6             6      38221  34.517589
7             7      74513  22.413419
8             8     106311  20.154420
9             9      99494  25.419499
10           10      89949  26.275635
11           11     102920  25.496273
12           12     105822  24.385969
13           13     107502  24.925393
14           14     104854  23.887590
15           15     111775  22.742421
16           16     123901  21.252329
17           17     141727  17.366909
18           18     149813  20.053263
19           19     133832  23.953912
20           20     105990  26.816491
21           21      98439  29.798311
22           22      85482  26.649130
23           23      66243  24.840900

Correct

For the solution, uncomment the appropriate line below.

对于解决方案,请取消注释下面相应的行。

q_5.solution()

Solution:


speeds_query = """
               WITH RelevantRides AS
               (
                   SELECT EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day, 
                          trip_miles, 
                          trip_seconds
                   FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                   WHERE trip_start_timestamp > '2016-01-01' AND 
                         trip_start_timestamp < '2016-04-01' AND 
                         trip_seconds > 0 AND 
                         trip_miles > 0
               )
               SELECT hour_of_day, 
                      COUNT(1) AS num_trips, 
                      3600 * SUM(trip_miles) / SUM(trip_seconds) AS avg_mph
               FROM RelevantRides
               GROUP BY hour_of_day
               ORDER BY hour_of_day
               """

# Set up the query (cancel the query if it would use too much of 
# your quota)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
speeds_query_job = client.query(speeds_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
speeds_result = speeds_query_job.to_dataframe()

# View results
print(speeds_result)

That's a hard query. If you made good progress towards the solution, congratulations!

这是一个很难的查询。 如果您在解决方案方面取得了良好进展,那么恭喜您!

Keep going

继续前进

You can write very complex queries now with a single data source. But nothing expands the horizons of SQL as much as the ability to combine or JOIN tables.

您现在可以使用单个数据源编写非常复杂的查询。 但没有什么比组合或 JOIN 表的能力更能扩展 SQL 的视野了。

Click here to start the last lesson in the Intro to SQL micro-course.

单击此处开始 SQL 简介微课程中的最后一课。

05.exercise-as-with【练习:AS和WITH】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top