This notebook is an exercise in the SQL course. You can reference the tutorial at this link.

Introduction

介绍

Stack Overflow is a widely beloved question and answer site for technical questions. You'll probably use it yourself as you keep using SQL (or any programming language).

Stack Overflow 是一个广受欢迎的技术问题问答网站。当您继续使用 SQL（或任何编程语言）时，您可能会使用它。

Their data is publicly available. What cool things do you think it would be useful for?

他们的数据是公开的。您认为它可以用于哪些很酷的事情？

Here's one idea:

有一个想法：

You could set up a service that identifies the Stack Overflow users who have demonstrated expertise with a specific technology by answering related questions about it, so someone could hire those experts for in-depth help.

您可以设置一项服务，通过回答相关问题来识别具有特定技术专业知识的 Stack Overflow 用户，这样有人就可以聘请这些专家来提供深入的帮助。

In this exercise, you'll write the SQL queries that might serve as the foundation for this type of service.

在本练习中，您将编写可作为此类服务基础的 SQL 查询。

As usual, run the following cell to set up our feedback system before moving on.

像往常一样，在继续之前运行以下单元格来设置我们的反馈系统。

# Set up feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.sql.ex6 import *
print("Setup Complete")

Using Kaggle's public dataset BigQuery integration.
Setup Complete

Run the next cell to fetch the stackoverflow dataset.

运行下一个单元格以获取stackoverflow数据集。

from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "stackoverflow" dataset
dataset_ref = client.dataset("stackoverflow", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

Using Kaggle's public dataset BigQuery integration.

Exercises

练习

1) Explore the data

1) 探索数据

Before writing queries or JOIN clauses, you'll want to see what tables are available.

在编写查询或 JOIN 子句之前，您需要查看哪些表可用。

Hint: Tab completion is helpful whenever you can't remember a command. Type client. and then hit the tab key. Don't forget the period before hitting tab.

提示：每当您忘记命令时，制表符补全都会很有帮助。输入client.，然后按 Tab 键。不要忘记在点击 Tab 之前需要加上句点。

tables = list(client.list_tables(dataset))

# Get a list of available tables 
list_of_tables = [table.table_id for table in tables]  # Your code here

# Print your answer
print(list_of_tables)

# Check your answer
q_1.check()

['badges', 'comments', 'post_history', 'post_links', 'posts_answers', 'posts_moderator_nomination', 'posts_orphaned_tag_wiki', 'posts_privilege_wiki', 'posts_questions', 'posts_tag_wiki', 'posts_tag_wiki_excerpt', 'posts_wiki_placeholder', 'stackoverflow_posts', 'tags', 'users', 'votes']

Correct

For the solution, uncomment the line below.

对于解决方案，请取消注释下面的行。

q_1.solution()

Solution:


# Get a list of available tables 
tables = list(client.list_tables(dataset))
list_of_tables = [table.table_id for table in tables]

2) Review relevant tables

2) 查看相关表格

If you are interested in people who answer questions on a given topic, the posts_answers table is a natural place to look. Run the following cell, and look at the output.

如果您对回答特定主题问题的人感兴趣，那么posts_answers表是一个自然的查找位置。运行以下单元格，然后查看输出。

# Construct a reference to the "posts_answers" table
answers_table_ref = dataset_ref.table("posts_answers")

# API request - fetch the table
answers_table = client.get_table(answers_table_ref)

# Preview the first five lines of the "posts_answers" table
client.list_rows(answers_table, max_results=5).to_dataframe()

	id	title	body	accepted_answer_id	answer_count	comment_count	community_owned_date	creation_date	favorite_count	last_activity_date	last_edit_date	last_editor_display_name	last_editor_user_id	owner_display_name	owner_user_id	parent_id	post_type_id	score	tags	view_count
0	18	None	<p>For a table like this:</p>\n\n<pre><code>CR...	None	None	2	NaT	2008-08-01 05:12:44.193000+00:00	None	2016-06-02 05:56:26.060000+00:00	2016-06-02 05:56:26.060000+00:00	Jeff Atwood	126039	phpguy	NaN	17	2	59	None	None
1	165	None	<p>You can use a <a href="http://sharpdevelop....	None	None	0	NaT	2008-08-01 18:04:25.023000+00:00	None	2019-04-06 14:03:51.080000+00:00	2019-04-06 14:03:51.080000+00:00	None	1721793	user2189331	NaN	145	2	10	None	None
2	1028	None	<p>The VB code looks something like this:</p>\...	None	None	0	NaT	2008-08-04 04:58:40.300000+00:00	None	2013-02-07 13:22:14.680000+00:00	2013-02-07 13:22:14.680000+00:00	None	395659	user2189331	NaN	947	2	8	None	None
3	1073	None	<p>My first choice would be a dedicated heap t...	None	None	0	NaT	2008-08-04 07:51:02.997000+00:00	None	2015-09-01 17:32:32.120000+00:00	2015-09-01 17:32:32.120000+00:00	None	45459	user2189331	NaN	1069	2	29	None	None
4	1260	None	<p>I found the answer. all you have to do is a...	None	None	0	NaT	2008-08-04 14:06:02.863000+00:00	None	2016-12-20 08:38:48.867000+00:00	2016-12-20 08:38:48.867000+00:00	None	1221571	Jin	NaN	1229	2	1	None	None

It isn't clear yet how to find users who answered questions on any given topic. But posts_answers has a parent_id column. If you are familiar with the Stack Overflow site, you might figure out that the parent_id is the question each post is answering.

目前尚不清楚如何找到回答过任何给定主题问题的用户。但是 posts_answers 有一个 parent_id 列。如果您熟悉 Stack Overflow 站点，您可能会发现parent_id是每个帖子正在回答的问题。

Look at posts_questions using the cell below.

使用下面的单元格查看posts_questions。

# Construct a reference to the "posts_questions" table
questions_table_ref = dataset_ref.table("posts_questions")

# API request - fetch the table
questions_table = client.get_table(questions_table_ref)

# Preview the first five lines of the "posts_questions" table
client.list_rows(questions_table, max_results=5).to_dataframe()

	id	title	body	accepted_answer_id	community_owned_date	creation_date	favorite_count	last_activity_date	last_edit_date	last_editor_display_name	last_editor_user_id	owner_display_name	owner_user_id	parent_id	post_type_id	tags	view_count
0	320268	Html.ActionLink doesn’t render # properly	<p>When using Html.ActionLink passing a string...	NaN	NaT	2008-11-26 10:42:37.477000+00:00	0	2009-02-06 20:13:54.370000+00:00	NaT	None	NaN	Paulo	NaN	None	1	asp.net-mvc	390
1	324003	Primitive recursion	<p>how will i define the function 'simplify' ...	NaN	NaT	2008-11-27 15:12:37.497000+00:00	0	2012-09-25 19:54:40.597000+00:00	2012-09-25 19:54:40.597000+00:00	Marcin	1288.0	None	41000.0	None	1	haskell\|lambda\|functional-programming\|lambda-c...	497
2	390605	While vs. Do While	<p>I've seen both the blocks of code in use se...	390608.0	NaT	2008-12-24 01:49:54.230000+00:00	2	2008-12-24 03:08:55.897000+00:00	NaT	None	NaN	Unkwntech	115.0	None	1	language-agnostic\|loops	11262
3	413246	Protect ASP.NET Source code	<p>Im currently doing some research in how to ...	NaN	NaT	2009-01-05 14:23:51.040000+00:00	0	2009-03-24 21:30:22.370000+00:00	2009-01-05 14:42:28.257000+00:00	Tom Anderson	13502.0	Velnias	NaN	None	1	asp.net\|deployment\|obfuscation	4823
4	454921	Difference between "int[] myArray" and "int my...	<blockquote>\n <p><strong>Possible Duplicate:...	454928.0	NaT	2009-01-18 10:22:52.177000+00:00	0	2009-01-18 10:30:50.930000+00:00	2017-05-23 11:49:26.567000+00:00	None	-1.0	Evan Fosmark	49701.0	None	1	java\|arrays	798

Are there any fields that identify what topic or technology each question is about? If so, how could you find the IDs of users who answered questions about a specific topic?

是否有任何字段可以确定每个问题所涉及的主题或技术？如果是这样，您如何找到回答特定主题问题的用户的 ID？

Think about it, and then check the solution by running the code in the next cell.

想一想，然后通过运行下一个单元格中的代码来检查解决方案。

# Check your answer (Run this code cell to receive credit!)
q_2.solution()

Solution:
posts_questions has a column called tags which lists the topics/technologies each question is about.

posts_answers has a column called parent_id which identifies the ID of the question each answer is responding to.
posts_answers also has an owner_user_id column which specifies the ID of the user who answered the question.

You can join these two tables to:

determine the tags for each answer, and then
select the owner_user_id of the answers on the desired tag.

This is exactly what you will do over the next few questions.

3) Selecting the right questions

3) 选择正确的问题

A lot of this data is text.

这些数据很多都是文本。

We'll explore one last technique in this course which you can apply to this text.

我们将在本课程中探讨最后一项技术，您可以将其应用于本文。

A WHERE clause can limit your results to rows with certain text using the LIKE feature. For example, to select just the third row of the pets table from the tutorial, we could use the query in the picture below.

WHERE 子句可以使用 LIKE 功能将结果限制为包含某些文本的行。例如，要从教程中仅选择pets表的第三行，我们可以使用下图中的查询。

You can also use % as a "wildcard" for any number of characters. So you can also get the third row with:

您还可以使用%作为任意数量字符的通配符。所以你还可以通过以下方式获得第三行：

query = """
        SELECT * 
        FROM `bigquery-public-data.pet_records.pets` 
        WHERE Name LIKE '%ipl%'
        """

Try this yourself. Write a query that selects the id, title and owner_user_id columns from the posts_questions table.

你自己试试。编写一个查询，从posts_questions表中选择id、title和owner_user_id列。

Restrict the results to rows that contain the word "bigquery" in the tags column.
将结果限制为tags列中包含单词bigquery的行。
Include rows where there is other text in addition to the word "bigquery" (e.g., if a row has a tag "bigquery-sql", your results should include that too).
包括除bigquery一词之外还有其他文本的行（例如，如果一行具有标签bigquery-sql，则您的结果也应包括该标签）。

# Your code here
questions_query = """
                  SELECT id, title, owner_user_id
                  FROM `bigquery-public-data.stackoverflow.posts_questions`
                  WHERE tags like '%bigquery%'
                  """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
questions_query_job = client.query(questions_query, safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
questions_results = questions_query_job.to_dataframe() # Your code goes here

# Preview results
print(questions_results.head())

         id                                              title  owner_user_id
0  64345717  Loop by array and union looped result in BigQuery     13304769.0
1  64610766  BigQuery Transfer jobs from S3 stuck pending o...     14549617.0
2  64383871  How to get sum of values in days intervals usi...     12472644.0
3  64251948                BigQuery get row above empty column      4572124.0
4  64323398  SQL: Remove part of string that is in another ...      6089137.0

# Check your answer
q_3.check()

Correct

For a hint or the solution, uncomment the appropriate line below.

q_3.hint()
q_3.solution()

Hint: Your WHERE clause should be WHERE tags LIKE '%bigquery%'.

Solution:


questions_query = """
                  SELECT id, title, owner_user_id
                  FROM `bigquery-public-data.stackoverflow.posts_questions`
                  WHERE tags LIKE '%bigquery%'
                  """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
questions_query_job = client.query(questions_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
questions_results = questions_query_job.to_dataframe()

4) Your first join

4) 你的第一个JOIN

Now that you have a query to select questions on any given topic (in this case, you chose "bigquery"), you can find the answers to those questions with a JOIN.

现在您有一个查询来选择任何给定主题的问题（在本例中，您选择了bigquery），您可以使用 JOIN 找到这些问题的答案。

Write a query that returns the id, body and owner_user_id columns from the posts_answers table for answers to "bigquery"-related questions.

编写一个查询，返回posts_answers表中的id、body和owner_user_id列，以获取bigquery相关问题的答案。

You should have one row in your results for each answer to a question that has "bigquery" in the tags.
对于标签中包含bigquery的问题，您的结果中应该有一行。
Remember you can get the tags for a question from the tags column in the posts_questions table.
请记住，您可以从posts_questions表中的tags列获取问题的标签。

Here's a reminder of what a JOIN looked like in the tutorial:

下面提醒一下本教程中 JOIN 的样子：

query = """
        SELECT p.Name AS Pet_Name, o.Name AS Owner_Name
        FROM `bigquery-public-data.pet_records.pets` as p
        INNER JOIN `bigquery-public-data.pet_records.owners` as o 
            ON p.ID = o.Pet_ID
        """

It may be useful to scroll up and review the first several rows of the posts_answers and posts_questions tables.

向上滚动并查看posts_answers和posts_questions表的前几行可能会很有用。

answers_query = """
                SELECT a.id, a.body, a.owner_user_id
                FROM `bigquery-public-data.stackoverflow.posts_questions` AS q 
                INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                    ON q.id = a.parent_id
                WHERE q.tags LIKE '%bigquery%'
                """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
answers_query_job = client.query(answers_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
answers_results = answers_query_job.to_dataframe()
# Preview results
print(answers_results.head())

         id                                               body  owner_user_id
0  57032546  Another solution - same logic, just a diffe...      1391685.0
1  57032790  
Recommendation : 
\n\nselect\...     10369599.0
2  57036857  Basically, this is the answer
\n\n<...     11506172.0
3  57042242  first, I hope that year+week & year+day...      2929192.0
4  57049923  I think the solutions mentioned by others s...      1391685.0
# Check your answer
q_4.check()
/opt/conda/lib/python3.10/site-packages/learntools/sql/ex6.py:129: FutureWarning: Calling int on a single element Series is deprecated and will raise a TypeError in the future. Use int(ser.iloc[0]) instead
  correct_owner_user_id = int(answers_answer[answers_answer["id"] == first_id]["owner_user_id"])
/opt/conda/lib/python3.10/site-packages/learntools/sql/ex6.py:130: FutureWarning: Calling int on a single element Series is deprecated and will raise a TypeError in the future. Use int(ser.iloc[0]) instead
  submitted_owner_user_id = int(results[results["id"] == first_id]["owner_user_id"])
Correct
For a hint or the solution, uncomment the appropriate line below.
如需提示或解决方案，请取消注释下面相应的行。
# q_4.hint()
q_4.solution()
Solution: 

answers_query = """
                SELECT a.id, a.body, a.owner_user_id
                FROM `bigquery-public-data.stackoverflow.posts_questions` AS q 
                INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                    ON q.id = a.parent_id
                WHERE q.tags LIKE '%bigquery%'
                """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 27 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=27*10**10)
answers_query_job = client.query(answers_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
answers_results = answers_query_job.to_dataframe()

5) Answer the question
5) 回答问题
You have the merge you need. But you want a list of users who have answered many questions... which requires more work beyond your previous result.
您拥有所需的合并。 但是您想要一个回答了许多问题的用户列表...这需要比您之前的结果更多的工作。
Write a new query that has a single row for each user who answered at least one question with a tag that includes the string "bigquery". Your results should have two columns:
编写一个新查询，该查询对于每个回答至少一个问题（该问题的标签包含字符串bigquery）的用户。 您的结果应该有两列：

user_id - contains the owner_user_id column from the posts_answers table
user_id - 包含 posts_answers 表中的 owner_user_id 列
number_of_answers - contains the number of answers the user has written to "bigquery"-related questions
number_of_answers - 包含用户写入bigquery相关问题的答案数量

# Your code here
bigquery_experts_query = """
                            SELECT a.owner_user_id as user_id, count(1) as number_of_answers
                            FROM `bigquery-public-data.stackoverflow.posts_questions` as q
                            INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` as a
                                ON q.id = a.parent_id
                            WHERE q.tags like '%bigquery%'
                            group by a.owner_user_id
                            order by number_of_answers desc
                            """

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
bigquery_experts_query_job = client.query(bigquery_experts_query,safe_config) # Your code goes here
# API request - run the query, and return a pandas DataFrame
bigquery_experts_results = bigquery_experts_query_job.to_dataframe() # Your code goes here

# Preview results
print(bigquery_experts_results.head())
     user_id  number_of_answers
0  5221944.0               5203
1  1144035.0               1634
2   132438.0                898
3  6253347.0                737
4  1366527.0                620
# Check your answer
q_5.check()
/opt/conda/lib/python3.10/site-packages/learntools/sql/ex6.py:199: FutureWarning: Calling int on a single element Series is deprecated and will raise a TypeError in the future. Use int(ser.iloc[0]) instead
  correct_num = int(bigquery_experts_answer[bigquery_experts_answer["user_id"] == first_id]["number_of_answers"])
/opt/conda/lib/python3.10/site-packages/learntools/sql/ex6.py:200: FutureWarning: Calling int on a single element Series is deprecated and will raise a TypeError in the future. Use int(ser.iloc[0]) instead
  submitted_num = int(results[results["user_id"] == first_id]["number_of_answers"])
Correct
For a hint or the solution, uncomment the appropriate line below.
如需提示或解决方案，请取消注释下面相应的行。
#q_5.hint()
q_5.solution()
Solution: 

bigquery_experts_query = """
                         SELECT a.owner_user_id AS user_id, COUNT(1) AS number_of_answers
                         FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                         INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                             ON q.id = a.parent_Id
                         WHERE q.tags LIKE '%bigquery%'
                         GROUP BY a.owner_user_id
                         """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
bigquery_experts_query_job = client.query(bigquery_experts_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
bigquery_experts_results = bigquery_experts_query_job.to_dataframe()

6) Building a more generally useful service
6) 构建更普遍有用的服务
How could you convert what you've done to a general function a website could call on the backend to get experts on any topic?  
如何将您所做的工作转换为网站可以在后端调用的通用函数，以获取任何主题的专家？
Think about it and then check the solution below.
想一想，然后检查下面的解决方案。
# Check your answer (Run this code cell to receive credit!)
q_6.solution()
Solution: 

def expert_finder(topic, client):
    '''
    Returns a DataFrame with the user IDs who have written Stack Overflow answers on a topic.

    Inputs:
        topic: A string with the topic of interest
        client: A Client object that specifies the connection to the Stack Overflow dataset

    Outputs:
        results: A DataFrame with columns for user_id and number_of_answers. Follows similar logic to bigquery_experts_results shown above.
    '''
    my_query = """
               SELECT a.owner_user_id AS user_id, COUNT(1) AS number_of_answers
               FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
               INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                   ON q.id = a.parent_Id
               WHERE q.tags like '%{topic}%'
               GROUP BY a.owner_user_id
               """

    # Set up the query (a real service would have good error handling for 
    # queries that scan too much data)
    safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)      
    my_query_job = client.query(my_query, job_config=safe_config)

    # API request - run the query, and return a pandas DataFrame
    results = my_query_job.to_dataframe()

    return results

Congratulations!
恭喜！
You know all the key components to use BigQuery and SQL effectively. Your SQL skills are sufficient to unlock many of the world's largest datasets.
您了解有效使用 BigQuery 和 SQL 的所有关键组件。 您的 SQL 技能足以解锁许多世界上最大的数据集。
Want to go play with your new powers?  Kaggle has BigQuery datasets available here.
想发挥你的新能力吗？ Kaggle 在此处提供了 BigQuery 数据集。
06.exercise-joining-data【练习：关联数据】

06.exercise-joining-data【练习：关联数据】

Introduction

介绍

Exercises

练习

1) Explore the data

1) 探索数据

2) Review relevant tables

2) 查看相关表格

3) Selecting the right questions

3) 选择正确的问题

4) Your first join

4) 你的第一个JOIN

5) Answer the question

5) 回答问题

6) Building a more generally useful service

6) 构建更普遍有用的服务

Congratulations!

恭喜！

Leave a Reply Cancel reply