Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

04.exercise-order-by【练习:ORDER By】

This notebook is an exercise in the SQL course. You can reference the tutorial at this link.


Introduction

介绍

You've built up your SQL skills enough that the remaining hands-on exercises will use different datasets than you see in the explanations. If you need to get to know a new dataset, you can run a couple of SELECT queries to extract and review the data you need.

您已经拥有了足够的 SQL 技能,剩下的实践练习将使用与您在解释中看到的不同的数据集。 如果您需要了解新的数据集,可以运行几个 SELECT 查询来提取和查看所需的数据。

The next exercises are also more challenging than what you've done so far. Don't worry, you are ready for it!

接下来的练习也比您到目前为止所做的更具挑战性。 别担心,你已经准备好了!

Run the code in the following cell to get everything set up:

运行以下单元格中的代码以完成所有设置:

# Set up feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.sql.ex4 import *
print("Setup Complete")
Using Kaggle's public dataset BigQuery integration.
Setup Complete

The World Bank has made tons of interesting education data available through BigQuery. Run the following cell to see the first few rows of the international_education table from the world_bank_intl_education dataset.

世界银行通过 BigQuery 提供了大量有趣的教育数据。 运行以下单元格以查看world_bank_intl_education数据集中international_education表的前几行。

from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "world_bank_intl_education" dataset
dataset_ref = client.dataset("world_bank_intl_education", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "international_education" table
table_ref = dataset_ref.table("international_education")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the "international_education" table
client.list_rows(table, max_results=5).to_dataframe()
Using Kaggle's public dataset BigQuery integration.
country_name country_code indicator_name indicator_code value year
0 Chad TCD Enrolment in lower secondary education, both s... UIS.E.2 321921.0 2012
1 Chad TCD Enrolment in upper secondary education, both s... UIS.E.3 68809.0 2006
2 Chad TCD Enrolment in upper secondary education, both s... UIS.E.3 30551.0 1999
3 Chad TCD Enrolment in upper secondary education, both s... UIS.E.3 79784.0 2007
4 Chad TCD Repeaters in primary education, all grades, bo... UIS.R.1 282699.0 2006

Exercises

练习

The value in the indicator_code column describes what type of data is shown in a given row.

indicator_code列中的值描述了给定行中显示的数据类型。

One interesting indicator code is SE.XPD.TOTL.GD.ZS, which corresponds to "Government expenditure on education as % of GDP (%)".

一个有趣的指标代码是SE.XPD.TOTL.GD.ZS,它对应于政府教育支出占 GDP 的百分比 (%)

1) Government expenditure on education

1) 政府教育支出

Which countries spend the largest fraction of GDP on education?

哪些国家在教育上的支出占国内生产总值的比例最大?

To answer this question, consider only the rows in the dataset corresponding to indicator code SE.XPD.TOTL.GD.ZS, and write a query that returns the average value in the value column for each country in the dataset between the years 2010-2017 (including 2010 and 2017 in the average).

要回答此问题,请仅考虑数据集中与指标代码SE.XPD.TOTL.GD.ZS对应的行,并编写一个查询,返回数据集中2010-2017年(包括2010年和2017年的平均值)每个国家/地区的value列的平均值 。

Requirements:

要求:

  • Your results should have the country name rather than the country code. You will have one row for each country.
  • 您的结果应包含国家/地区名称而不是国家/地区代码。 每个国家/地区都有一行。
  • The aggregate function for average is AVG(). Use the name avg_ed_spending_pct for the column created by this aggregation.
  • 平均值的聚合函数是AVG()。 使用名称avg_ed_spending_pct作为此聚合函数创建的列。
  • Order the results so the countries that spend the largest fraction of GDP on education show up first.
  • 对结果进行排序,以便教育支出占 GDP 比例最大的国家首先出现。

In case it's useful to see a sample query, here's a query you saw in the tutorial (using a different dataset):

查看案例中的示例会有所帮助,这里是您在教程中看到的查询(使用不同的数据集):

# Query to find out the number of accidents for each day of the week
query = """
        SELECT COUNT(consecutive_number) AS num_accidents, 
               EXTRACT(DAYOFWEEK FROM timestamp_of_crash) AS day_of_week
        FROM `bigquery-public-data.nhtsa_traffic_fatalities.accident_2015`
        GROUP BY day_of_week
        ORDER BY num_accidents DESC
        """
# Your code goes here
country_spend_pct_query = """
                          SELECT country_name, AVG(value) as avg_ed_spending_pct
                          FROM `bigquery-public-data.world_bank_intl_education.international_education`
                          WHERE indicator_code='SE.XPD.TOTL.GD.ZS' and year between 2010 and 2017
                          GROUP BY country_name 
                          ORDER BY AVG(value) desc
                          """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
country_spend_pct_query_job = client.query(country_spend_pct_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
country_spending_results = country_spend_pct_query_job.to_dataframe()

# View top few rows of results
print(country_spending_results.head())

# Check your answer
q_1.check()
            country_name  avg_ed_spending_pct
0                   Cuba            12.837270
1  Micronesia, Fed. Sts.            12.467750
2        Solomon Islands            10.001080
3                Moldova             8.372153
4                Namibia             8.349610

Correct

For a hint or the solution, uncomment the appropriate line below.

如需提示或解决方案,请取消注释下面相应的行。

# q_1.hint()
q_1.solution()

Solution:


country_spend_pct_query = """
                          SELECT country_name, AVG(value) AS avg_ed_spending_pct
                          FROM `bigquery-public-data.world_bank_intl_education.international_education`
                          WHERE indicator_code = 'SE.XPD.TOTL.GD.ZS' and year >= 2010 and year <= 2017
                          GROUP BY country_name
                          ORDER BY avg_ed_spending_pct DESC
                          """

2) Identify interesting codes to explore

2) 确定要探索的有趣代码

The last question started by telling you to focus on rows with the code SE.XPD.TOTL.GD.ZS. But how would you find more interesting indicator codes to explore?

最后一个问题首先告诉您关注代码为SE.XPD.TOTL.GD.ZS的行。 但如何找到更有趣的指标代码来探索呢?

There are 1000s of codes in the dataset, so it would be time consuming to review them all. But many codes are available for only a few countries. When browsing the options for different codes, you might restrict yourself to codes that are reported by many countries.

数据集中有数千个代码,因此检查所有代码将非常耗时。 但许多代码仅适用于少数国家/地区。 在浏览不同代码的选项时,您可能会将自己限制使用许多国家/地区都有的代码。

Write a query below that selects the indicator code and indicator name for all codes with at least 175 rows in the year 2016.

在下面编写一个查询,查找在 2016 年中所有代码里至少有 175 行数据的指标代码和指标名称。

Requirements:

要求:

  • You should have one row for each indicator code.
  • 每个指标代码应该占一行。
  • The columns in your results should be called indicator_code, indicator_name, and num_rows.
  • 结果中的列应称为indicator_codeindicator_namenum_rows
  • Only select codes with 175 or more rows in the raw database (exactly 175 rows would be included).
  • 仅选择原始数据库中具有 175 行或更多行的代码(包含 175 行)。
  • To get both the indicator_code and indicator_name in your resulting DataFrame, you need to include both in your SELECT statement (in addition to a COUNT() aggregation). This requires you to include both in your GROUP BY clause.
  • 要在生成的 DataFrame 中同时获取indicator_codeindicator_name,您需要将两者包含在 SELECT 语句中(除了 COUNT() 聚合之外)。 这要求您将两者都包含在 GROUP BY 子句中。
  • Order from results most frequent to least frequent.
  • 按最频繁到最不频繁的结果排序。
# Your code goes here
code_count_query = """
                  SELECT indicator_code, indicator_name, count(1) as num_rows
                  FROM `bigquery-public-data.world_bank_intl_education.international_education`
                  WHERE year = 2016
                  GROUP BY indicator_code, indicator_name
                  having count(1) >= 175
                  ORDER BY count(1) desc
                  """

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
code_count_query_job = client.query(code_count_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
code_count_results = code_count_query_job.to_dataframe()

# View top few rows of results
print(code_count_results.head())

# Check your answer
q_2.check()
      indicator_code                       indicator_name  num_rows
0        SP.POP.TOTL                    Population, total       232
1        SP.POP.GROW         Population growth (annual %)       232
2     IT.NET.USER.P2      Internet users (per 100 people)       223
3  SP.POP.TOTL.FE.ZS      Population, female (% of total)       213
4        SH.DYN.MORT  Mortality rate, under-5 (per 1,000)       213

Correct

For a hint or the solution, uncomment the appropriate line below.

如需提示或解决方案,请取消注释下面相应的行。

# q_2.hint()
q_2.solution()

Solution:


code_count_query = """
                   SELECT indicator_code, indicator_name, COUNT(1) AS num_rows
                   FROM `bigquery-public-data.world_bank_intl_education.international_education`
                   WHERE year = 2016
                   GROUP BY indicator_name, indicator_code
                   HAVING COUNT(1) >= 175
                   ORDER BY COUNT(1) DESC
                   """

Keep Going

继续前进

Click here to learn how to use AS and WITH to clean up your code and help you construct more complex queries.

点击此处 了解如何使用 ASWITH 来清理您的代码并帮助您构建 更复杂的查询。

04.exercise-order-by【练习:ORDER By】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top