This notebook is an exercise in the SQL course. You can reference the tutorial at this link.
Introduction
介绍
Queries with GROUP BY can be powerful. There are many small things that can trip you up (like the order of the clauses), but it will start to feel natural once you've done it a few times. Here, you'll write queries using GROUP BY to answer questions from the Hacker News dataset.
使用 GROUP BY 进行的查询非常强大。 有很多小事情可能会让你陷入困境(比如子句的顺序),但一旦你做了几次,就会开始感觉很自然。 在这里,您将使用 GROUP BY 编写查询来回答 Hacker News 数据集中的问题。
Before you get started, run the following cell to set everything up:
在开始之前,请运行以下单元格来设置所有内容:
# Set up feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.sql.ex3 import *
print("Setup Complete")
Using Kaggle's public dataset BigQuery integration.
Setup Complete
The code cell below fetches the full
table from the hacker_news
dataset. We also preview the first five rows of the table.
下面的代码单元格从hacker_news
数据集中获取full
表。 我们还预览表的前五行。
from google.cloud import bigquery
# Create a "Client" object
client = bigquery.Client()
# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")
# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)
# Construct a reference to the "full" table
table_ref = dataset_ref.table("full")
# API request - fetch the table
table = client.get_table(table_ref)
# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()
Using Kaggle's public dataset BigQuery integration.
title | url | text | dead | by | score | time | timestamp | type | id | parent | descendants | ranking | deleted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | None | If the crocodile looked him up on Google, we b... | None | raxxorrax | NaN | 1633421535 | 2021-10-05 08:12:15+00:00 | comment | 28756662 | 28750122 | NaN | NaN | None |
1 | None | None | What exactly are you looking for? I think Pyto... | None | abiro | NaN | 1569141387 | 2019-09-22 08:36:27+00:00 | comment | 21040311 | 21040141 | NaN | NaN | None |
2 | None | None | Ironically, this very project might help out w... | None | mjevans | NaN | 1505769703 | 2017-09-18 21:21:43+00:00 | comment | 15279716 | 15276626 | NaN | NaN | None |
3 | None | None | As you start to gain some experience it can be... | None | every_other | NaN | 1538575027 | 2018-10-03 13:57:07+00:00 | comment | 18130207 | 18128477 | NaN | NaN | None |
4 | None | None | That’s what I was referring to, yes. I heard o... | None | manmal | NaN | 1615664155 | 2021-03-13 19:35:55+00:00 | comment | 26449260 | 26449237 | NaN | NaN | None |
Exercises
练习
1) Prolific commenters
1) 多产的评论者
Hacker News would like to send awards to everyone who has written more than 10,000 posts. Write a query that returns all authors with more than 10,000 posts as well as their post counts. Call the column with post counts NumPosts
.
Hacker News 希望向所有发表超过 10,000 篇帖子的人颁发奖项。 编写一个查询,返回所有帖子超过 10,000 篇的作者及其帖子计数。 将包含帖子计数的列称为NumPosts
。
In case sample query is helpful, here is a query you saw in the tutorial to answer a similar question:
如果示例查询有帮助,以下是您在教程中看到的用于回答类似问题的查询:
query = """
SELECT parent, COUNT(1) AS NumPosts
FROM `bigquery-public-data.hacker_news.full`
GROUP BY parent
HAVING COUNT(1) > 10
"""
# Query to select prolific commenters and post counts
# prolific_commenters_query = """____""" # Your code goes here
prolific_commenters_query = """
SELECT `by` as author, COUNT(1) AS NumPosts
FROM `bigquery-public-data.hacker_news.full`
GROUP BY `by`
HAVING COUNT(1) > 10000
ORDER BY COUNT(1) DESC
"""
# Set up the query (cancel the query if it would use too much of
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(prolific_commenters_query, job_config=safe_config)
# API request - run the query, and return a pandas DataFrame
prolific_commenters = query_job.to_dataframe()
# View top few rows of results
print(prolific_commenters.head())
# Check your answer
q_1.check()
author NumPosts
0 None 883263
1 dang 52083
2 tptacek 51845
3 jacquesm 45023
4 pjmlp 43943
Correct
For the solution, uncomment the line below.
对于解决方案,请取消注释下面的行。
q_1.solution()
Solution:
prolific_commenters_query = """
SELECT `by` AS author, COUNT(1) AS NumPosts
FROM `bigquery-public-data.hacker_news.full`
GROUP BY author
HAVING COUNT(1) > 10000
"""
2) Deleted comments
2) 删除评论
How many comments have been deleted? (If a comment was deleted, the deleted
column in the table will have the value True
.)
有多少评论被删除了? (如果评论被删除,表中的已删除
列将具有值True
。)
# Write your query here and figure out the answer
num_deleted_query = """
SELECT COUNT(1) AS NumPosts
FROM `bigquery-public-data.hacker_news.full`
WHERE deleted = True
"""
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(num_deleted_query, job_config=safe_config)
# API request - run the query, and return a pandas DataFrame
num_deleted = query_job.to_dataframe()
# View top few rows of results
print(num_deleted)
NumPosts
0 859401
# num_deleted_posts = ____ # Put your answer here
num_deleted_posts = num_deleted['NumPosts'][0]
# Check your answer
q_2.check()
For the solution, uncomment the line below.
对于解决方案,请取消注释下面的行。
q_2.solution()
Keep Going
继续前进
Click here to move on and learn about the ORDER BY clause.
点击此处 继续了解 ORDER BY 子句。