Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

03.exercise-group-by-having-count【练习:GROUP By、Having和COUNT】

This notebook is an exercise in the SQL course. You can reference the tutorial at this link.


Introduction

介绍

Queries with GROUP BY can be powerful. There are many small things that can trip you up (like the order of the clauses), but it will start to feel natural once you've done it a few times. Here, you'll write queries using GROUP BY to answer questions from the Hacker News dataset.

使用 GROUP BY 进行的查询非常强大。 有很多小事情可能会让你陷入困境(比如子句的顺序),但一旦你做了几次,就会开始感觉很自然。 在这里,您将使用 GROUP BY 编写查询来回答 Hacker News 数据集中的问题。

Before you get started, run the following cell to set everything up:

在开始之前,请运行以下单元格来设置所有内容:

# Set up feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.sql.ex3 import *
print("Setup Complete")
Using Kaggle's public dataset BigQuery integration.
Setup Complete

The code cell below fetches the full table from the hacker_news dataset. We also preview the first five rows of the table.

下面的代码单元格从hacker_news数据集中获取full表。 我们还预览表的前五行。

from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "full" table
table_ref = dataset_ref.table("full")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()
Using Kaggle's public dataset BigQuery integration.
title url text dead by score time timestamp type id parent descendants ranking deleted
0 None None If the crocodile looked him up on Google, we b... None raxxorrax NaN 1633421535 2021-10-05 08:12:15+00:00 comment 28756662 28750122 NaN NaN None
1 None None What exactly are you looking for? I think Pyto... None abiro NaN 1569141387 2019-09-22 08:36:27+00:00 comment 21040311 21040141 NaN NaN None
2 None None Ironically, this very project might help out w... None mjevans NaN 1505769703 2017-09-18 21:21:43+00:00 comment 15279716 15276626 NaN NaN None
3 None None As you start to gain some experience it can be... None every_other NaN 1538575027 2018-10-03 13:57:07+00:00 comment 18130207 18128477 NaN NaN None
4 None None That’s what I was referring to, yes. I heard o... None manmal NaN 1615664155 2021-03-13 19:35:55+00:00 comment 26449260 26449237 NaN NaN None

Exercises

练习

1) Prolific commenters

1) 多产的评论者

Hacker News would like to send awards to everyone who has written more than 10,000 posts. Write a query that returns all authors with more than 10,000 posts as well as their post counts. Call the column with post counts NumPosts.

Hacker News 希望向所有发表超过 10,000 篇帖子的人颁发奖项。 编写一个查询,返回所有帖子超过 10,000 篇的作者及其帖子计数。 将包含帖子计数的列称为NumPosts

In case sample query is helpful, here is a query you saw in the tutorial to answer a similar question:

如果示例查询有帮助,以下是您在教程中看到的用于回答类似问题的查询:

query = """
        SELECT parent, COUNT(1) AS NumPosts
        FROM `bigquery-public-data.hacker_news.full`
        GROUP BY parent
        HAVING COUNT(1) > 10
        """
# Query to select prolific commenters and post counts
# prolific_commenters_query = """____""" # Your code goes here

prolific_commenters_query = """
                            SELECT `by` as author, COUNT(1) AS NumPosts
                            FROM `bigquery-public-data.hacker_news.full`
                            GROUP BY `by`
                            HAVING COUNT(1) > 10000
                            ORDER BY COUNT(1) DESC
                            """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(prolific_commenters_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
prolific_commenters = query_job.to_dataframe()

# View top few rows of results
print(prolific_commenters.head())

# Check your answer
q_1.check()
     author  NumPosts
0      None    883263
1      dang     52083
2   tptacek     51845
3  jacquesm     45023
4     pjmlp     43943

Correct

For the solution, uncomment the line below.

对于解决方案,请取消注释下面的行。

q_1.solution()

Solution:


prolific_commenters_query = """
                            SELECT `by` AS author, COUNT(1) AS NumPosts
                            FROM `bigquery-public-data.hacker_news.full`
                            GROUP BY author
                            HAVING COUNT(1) > 10000
                            """

2) Deleted comments

2) 删除评论

How many comments have been deleted? (If a comment was deleted, the deleted column in the table will have the value True.)

有多少评论被删除了? (如果评论被删除,表中的已删除列将具有值True。)

# Write your query here and figure out the answer
num_deleted_query = """
                    SELECT COUNT(1) AS NumPosts
                    FROM `bigquery-public-data.hacker_news.full`
                    WHERE deleted = True
                    """

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(num_deleted_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
num_deleted = query_job.to_dataframe()

# View top few rows of results
print(num_deleted)
   NumPosts
0    859401
# num_deleted_posts = ____ # Put your answer here
num_deleted_posts = num_deleted['NumPosts'][0]

# Check your answer
q_2.check()

For the solution, uncomment the line below.

对于解决方案,请取消注释下面的行。

q_2.solution()

Keep Going

继续前进

Click here to move on and learn about the ORDER BY clause.

点击此处 继续了解 ORDER BY 子句。

03.exercise-group-by-having-count【练习:GROUP By、Having和COUNT】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top