This notebook is an exercise in the SQL course. You can reference the tutorial at this link.

Introduction

介绍

Queries with GROUP BY can be powerful. There are many small things that can trip you up (like the order of the clauses), but it will start to feel natural once you've done it a few times. Here, you'll write queries using GROUP BY to answer questions from the Hacker News dataset.

使用 GROUP BY 进行的查询非常强大。有很多小事情可能会让你陷入困境（比如子句的顺序），但一旦你做了几次，就会开始感觉很自然。在这里，您将使用 GROUP BY 编写查询来回答 Hacker News 数据集中的问题。

Before you get started, run the following cell to set everything up:

在开始之前，请运行以下单元格来设置所有内容：

# Set up feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.sql.ex3 import *
print("Setup Complete")

Using Kaggle's public dataset BigQuery integration.
Setup Complete

The code cell below fetches the full table from the hacker_news dataset. We also preview the first five rows of the table.

下面的代码单元格从hacker_news数据集中获取full表。我们还预览表的前五行。

from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "full" table
table_ref = dataset_ref.table("full")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()

Using Kaggle's public dataset BigQuery integration.

	title	url	text	dead	by	score	time	timestamp	type	id	parent	descendants	ranking	deleted
0	None	None	If the crocodile looked him up on Google, we b...	None	raxxorrax	NaN	1633421535	2021-10-05 08:12:15+00:00	comment	28756662	28750122	NaN	NaN	None
1	None	None	What exactly are you looking for? I think Pyto...	None	abiro	NaN	1569141387	2019-09-22 08:36:27+00:00	comment	21040311	21040141	NaN	NaN	None
2	None	None	Ironically, this very project might help out w...	None	mjevans	NaN	1505769703	2017-09-18 21:21:43+00:00	comment	15279716	15276626	NaN	NaN	None
3	None	None	As you start to gain some experience it can be...	None	every_other	NaN	1538575027	2018-10-03 13:57:07+00:00	comment	18130207	18128477	NaN	NaN	None
4	None	None	That’s what I was referring to, yes. I heard o...	None	manmal	NaN	1615664155	2021-03-13 19:35:55+00:00	comment	26449260	26449237	NaN	NaN	None

Exercises

练习

1) Prolific commenters

1) 多产的评论者

Hacker News would like to send awards to everyone who has written more than 10,000 posts. Write a query that returns all authors with more than 10,000 posts as well as their post counts. Call the column with post counts NumPosts.

Hacker News 希望向所有发表超过 10,000 篇帖子的人颁发奖项。编写一个查询，返回所有帖子超过 10,000 篇的作者及其帖子计数。将包含帖子计数的列称为NumPosts。

In case sample query is helpful, here is a query you saw in the tutorial to answer a similar question:

如果示例查询有帮助，以下是您在教程中看到的用于回答类似问题的查询：

query = """
        SELECT parent, COUNT(1) AS NumPosts
        FROM `bigquery-public-data.hacker_news.full`
        GROUP BY parent
        HAVING COUNT(1) > 10
        """

# Query to select prolific commenters and post counts
# prolific_commenters_query = """____""" # Your code goes here

prolific_commenters_query = """
                            SELECT `by` as author, COUNT(1) AS NumPosts
                            FROM `bigquery-public-data.hacker_news.full`
                            GROUP BY `by`
                            HAVING COUNT(1) > 10000
                            ORDER BY COUNT(1) DESC
                            """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(prolific_commenters_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
prolific_commenters = query_job.to_dataframe()

# View top few rows of results
print(prolific_commenters.head())

# Check your answer
q_1.check()

     author  NumPosts
0      None    883263
1      dang     52083
2   tptacek     51845
3  jacquesm     45023
4     pjmlp     43943

Correct

For the solution, uncomment the line below.

对于解决方案，请取消注释下面的行。

q_1.solution()

Solution:


prolific_commenters_query = """
                            SELECT `by` AS author, COUNT(1) AS NumPosts
                            FROM `bigquery-public-data.hacker_news.full`
                            GROUP BY author
                            HAVING COUNT(1) > 10000
                            """

2) Deleted comments

2) 删除评论

How many comments have been deleted? (If a comment was deleted, the deleted column in the table will have the value True.)

有多少评论被删除了？（如果评论被删除，表中的已删除列将具有值True。）

# Write your query here and figure out the answer
num_deleted_query = """
                    SELECT COUNT(1) AS NumPosts
                    FROM `bigquery-public-data.hacker_news.full`
                    WHERE deleted = True
                    """

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(num_deleted_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
num_deleted = query_job.to_dataframe()

# View top few rows of results
print(num_deleted)

   NumPosts
0    859401

# num_deleted_posts = ____ # Put your answer here
num_deleted_posts = num_deleted['NumPosts'][0]

# Check your answer
q_2.check()

For the solution, uncomment the line below.

对于解决方案，请取消注释下面的行。

q_2.solution()

Keep Going

继续前进

Click here to move on and learn about the ORDER BY clause.

点击此处 继续了解 ORDER BY 子句。

03.exercise-group-by-having-count【练习：GROUP By、Having和COUNT】