This notebook is an exercise in the SQL course. You can reference the tutorial at this link.
Introduction
介绍
Stack Overflow is a widely beloved question and answer site for technical questions. You'll probably use it yourself as you keep using SQL (or any programming language).
Stack Overflow 是一个广受欢迎的技术问题问答网站。 当您继续使用 SQL(或任何编程语言)时,您可能会使用它。
Their data is publicly available. What cool things do you think it would be useful for?
他们的数据是公开的。 您认为它可以用于哪些很酷的事情?
Here's one idea:
有一个想法:
You could set up a service that identifies the Stack Overflow users who have demonstrated expertise with a specific technology by answering related questions about it, so someone could hire those experts for in-depth help.
您可以设置一项服务,通过回答相关问题来识别具有特定技术专业知识的 Stack Overflow 用户,这样有人就可以聘请这些专家来提供深入的帮助。
In this exercise, you'll write the SQL queries that might serve as the foundation for this type of service.
在本练习中,您将编写可作为此类服务基础的 SQL 查询。
As usual, run the following cell to set up our feedback system before moving on.
像往常一样,在继续之前运行以下单元格来设置我们的反馈系统。
# Set up feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.sql.ex6 import *
print("Setup Complete")
Using Kaggle's public dataset BigQuery integration.
Setup Complete
Run the next cell to fetch the stackoverflow
dataset.
运行下一个单元格以获取stackoverflow
数据集。
from google.cloud import bigquery
# Create a "Client" object
client = bigquery.Client()
# Construct a reference to the "stackoverflow" dataset
dataset_ref = client.dataset("stackoverflow", project="bigquery-public-data")
# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)
Using Kaggle's public dataset BigQuery integration.
Exercises
练习
1) Explore the data
1) 探索数据
Before writing queries or JOIN clauses, you'll want to see what tables are available.
在编写查询或 JOIN 子句之前,您需要查看哪些表可用。
Hint: Tab completion is helpful whenever you can't remember a command. Type client.
and then hit the tab key. Don't forget the period before hitting tab.
提示:每当您忘记命令时,制表符补全都会很有帮助。 输入client.
,然后按 Tab 键。 不要忘记在点击 Tab 之前需要加上句点。
tables = list(client.list_tables(dataset))
# Get a list of available tables
list_of_tables = [table.table_id for table in tables] # Your code here
# Print your answer
print(list_of_tables)
# Check your answer
q_1.check()
['badges', 'comments', 'post_history', 'post_links', 'posts_answers', 'posts_moderator_nomination', 'posts_orphaned_tag_wiki', 'posts_privilege_wiki', 'posts_questions', 'posts_tag_wiki', 'posts_tag_wiki_excerpt', 'posts_wiki_placeholder', 'stackoverflow_posts', 'tags', 'users', 'votes']
Correct
For the solution, uncomment the line below.
对于解决方案,请取消注释下面的行。
q_1.solution()
Solution:
# Get a list of available tables
tables = list(client.list_tables(dataset))
list_of_tables = [table.table_id for table in tables]
2) Review relevant tables
2) 查看相关表格
If you are interested in people who answer questions on a given topic, the posts_answers
table is a natural place to look. Run the following cell, and look at the output.
如果您对回答特定主题问题的人感兴趣,那么posts_answers
表是一个自然的查找位置。 运行以下单元格,然后查看输出。
# Construct a reference to the "posts_answers" table
answers_table_ref = dataset_ref.table("posts_answers")
# API request - fetch the table
answers_table = client.get_table(answers_table_ref)
# Preview the first five lines of the "posts_answers" table
client.list_rows(answers_table, max_results=5).to_dataframe()
id | title | body | accepted_answer_id | answer_count | comment_count | community_owned_date | creation_date | favorite_count | last_activity_date | last_edit_date | last_editor_display_name | last_editor_user_id | owner_display_name | owner_user_id | parent_id | post_type_id | score | tags | view_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 18 | None | <p>For a table like this:</p>\n\n<pre><code>CR... | None | None | 2 | NaT | 2008-08-01 05:12:44.193000+00:00 | None | 2016-06-02 05:56:26.060000+00:00 | 2016-06-02 05:56:26.060000+00:00 | Jeff Atwood | 126039 | phpguy | NaN | 17 | 2 | 59 | None | None |
1 | 165 | None | <p>You can use a <a href="http://sharpdevelop.... | None | None | 0 | NaT | 2008-08-01 18:04:25.023000+00:00 | None | 2019-04-06 14:03:51.080000+00:00 | 2019-04-06 14:03:51.080000+00:00 | None | 1721793 | user2189331 | NaN | 145 | 2 | 10 | None | None |
2 | 1028 | None | <p>The VB code looks something like this:</p>\... | None | None | 0 | NaT | 2008-08-04 04:58:40.300000+00:00 | None | 2013-02-07 13:22:14.680000+00:00 | 2013-02-07 13:22:14.680000+00:00 | None | 395659 | user2189331 | NaN | 947 | 2 | 8 | None | None |
3 | 1073 | None | <p>My first choice would be a dedicated heap t... | None | None | 0 | NaT | 2008-08-04 07:51:02.997000+00:00 | None | 2015-09-01 17:32:32.120000+00:00 | 2015-09-01 17:32:32.120000+00:00 | None | 45459 | user2189331 | NaN | 1069 | 2 | 29 | None | None |
4 | 1260 | None | <p>I found the answer. all you have to do is a... | None | None | 0 | NaT | 2008-08-04 14:06:02.863000+00:00 | None | 2016-12-20 08:38:48.867000+00:00 | 2016-12-20 08:38:48.867000+00:00 | None | 1221571 | Jin | NaN | 1229 | 2 | 1 | None | None |
It isn't clear yet how to find users who answered questions on any given topic. But posts_answers
has a parent_id
column. If you are familiar with the Stack Overflow site, you might figure out that the parent_id
is the question each post is answering.
目前尚不清楚如何找到回答过任何给定主题问题的用户。 但是 posts_answers
有一个 parent_id
列。 如果您熟悉 Stack Overflow 站点,您可能会发现parent_id
是每个帖子正在回答的问题。
Look at posts_questions
using the cell below.
使用下面的单元格查看posts_questions
。
# Construct a reference to the "posts_questions" table
questions_table_ref = dataset_ref.table("posts_questions")
# API request - fetch the table
questions_table = client.get_table(questions_table_ref)
# Preview the first five lines of the "posts_questions" table
client.list_rows(questions_table, max_results=5).to_dataframe()
id | title | body | accepted_answer_id | answer_count | comment_count | community_owned_date | creation_date | favorite_count | last_activity_date | last_edit_date | last_editor_display_name | last_editor_user_id | owner_display_name | owner_user_id | parent_id | post_type_id | score | tags | view_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 320268 | Html.ActionLink doesn’t render # properly | <p>When using Html.ActionLink passing a string... | NaN | 0 | 0 | NaT | 2008-11-26 10:42:37.477000+00:00 | 0 | 2009-02-06 20:13:54.370000+00:00 | NaT | None | NaN | Paulo | NaN | None | 1 | 0 | asp.net-mvc | 390 |
1 | 324003 | Primitive recursion | <p>how will i define the function 'simplify' ... | NaN | 0 | 0 | NaT | 2008-11-27 15:12:37.497000+00:00 | 0 | 2012-09-25 19:54:40.597000+00:00 | 2012-09-25 19:54:40.597000+00:00 | Marcin | 1288.0 | None | 41000.0 | None | 1 | 0 | haskell|lambda|functional-programming|lambda-c... | 497 |
2 | 390605 | While vs. Do While | <p>I've seen both the blocks of code in use se... | 390608.0 | 0 | 0 | NaT | 2008-12-24 01:49:54.230000+00:00 | 2 | 2008-12-24 03:08:55.897000+00:00 | NaT | None | NaN | Unkwntech | 115.0 | None | 1 | 0 | language-agnostic|loops | 11262 |
3 | 413246 | Protect ASP.NET Source code | <p>Im currently doing some research in how to ... | NaN | 0 | 0 | NaT | 2009-01-05 14:23:51.040000+00:00 | 0 | 2009-03-24 21:30:22.370000+00:00 | 2009-01-05 14:42:28.257000+00:00 | Tom Anderson | 13502.0 | Velnias | NaN | None | 1 | 0 | asp.net|deployment|obfuscation | 4823 |
4 | 454921 | Difference between "int[] myArray" and "int my... | <blockquote>\n <p><strong>Possible Duplicate:... | 454928.0 | 0 | 0 | NaT | 2009-01-18 10:22:52.177000+00:00 | 0 | 2009-01-18 10:30:50.930000+00:00 | 2017-05-23 11:49:26.567000+00:00 | None | -1.0 | Evan Fosmark | 49701.0 | None | 1 | 0 | java|arrays | 798 |
Are there any fields that identify what topic or technology each question is about? If so, how could you find the IDs of users who answered questions about a specific topic?
是否有任何字段可以确定每个问题所涉及的主题或技术? 如果是这样,您如何找到回答特定主题问题的用户的 ID?
Think about it, and then check the solution by running the code in the next cell.
想一想,然后通过运行下一个单元格中的代码来检查解决方案。
# Check your answer (Run this code cell to receive credit!)
q_2.solution()
Solution:
posts_questions
has a column called tags
which lists the topics/technologies each question is about.
posts_answers
has a column called parent_id
which identifies the ID of the question each answer is responding to.
posts_answers
also has an owner_user_id
column which specifies the ID of the user who answered the question.
You can join these two tables to:
- determine the
tags
for each answer, and then - select the
owner_user_id
of the answers on the desired tag.
This is exactly what you will do over the next few questions.
3) Selecting the right questions
3) 选择正确的问题
A lot of this data is text.
这些数据很多都是文本。
We'll explore one last technique in this course which you can apply to this text.
我们将在本课程中探讨最后一项技术,您可以将其应用于本文。
A WHERE clause can limit your results to rows with certain text using the LIKE feature. For example, to select just the third row of the pets
table from the tutorial, we could use the query in the picture below.
WHERE 子句可以使用 LIKE 功能将结果限制为包含某些文本的行。 例如,要从教程中仅选择pets
表的第三行,我们可以使用下图中的查询。
You can also use %
as a "wildcard" for any number of characters. So you can also get the third row with:
您还可以使用%
作为任意数量字符的通配符
。 所以你还可以通过以下方式获得第三行:
query = """
SELECT *
FROM `bigquery-public-data.pet_records.pets`
WHERE Name LIKE '%ipl%'
"""
Try this yourself. Write a query that selects the id
, title
and owner_user_id
columns from the posts_questions
table.
你自己试试。 编写一个查询,从posts_questions
表中选择id
、title
和owner_user_id
列。
- Restrict the results to rows that contain the word "bigquery" in the
tags
column. - 将结果限制为
tags
列中包含单词bigquery
的行。 - Include rows where there is other text in addition to the word "bigquery" (e.g., if a row has a tag "bigquery-sql", your results should include that too).
- 包括除
bigquery
一词之外还有其他文本的行(例如,如果一行具有标签bigquery-sql
,则您的结果也应包括该标签)。
# Your code here
questions_query = """
SELECT id, title, owner_user_id
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%bigquery%'
"""
# Set up the query (cancel the query if it would use too much of
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
questions_query_job = client.query(questions_query, safe_config) # Your code goes here
# API request - run the query, and return a pandas DataFrame
questions_results = questions_query_job.to_dataframe() # Your code goes here
# Preview results
print(questions_results.head())
id title owner_user_id
0 64345717 Loop by array and union looped result in BigQuery 13304769.0
1 64610766 BigQuery Transfer jobs from S3 stuck pending o... 14549617.0
2 64383871 How to get sum of values in days intervals usi... 12472644.0
3 64251948 BigQuery get row above empty column 4572124.0
4 64323398 SQL: Remove part of string that is in another ... 6089137.0
# Check your answer
q_3.check()
Correct
For a hint or the solution, uncomment the appropriate line below.
q_3.hint()
q_3.solution()
Hint: Your WHERE clause should be WHERE tags LIKE '%bigquery%'
.
Solution:
questions_query = """
SELECT id, title, owner_user_id
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags LIKE '%bigquery%'
"""
# Set up the query (cancel the query if it would use too much of
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
questions_query_job = client.query(questions_query, job_config=safe_config)
# API request - run the query, and return a pandas DataFrame
questions_results = questions_query_job.to_dataframe()
4) Your first join
4) 你的第一个JOIN
Now that you have a query to select questions on any given topic (in this case, you chose "bigquery"), you can find the answers to those questions with a JOIN.
现在您有一个查询来选择任何给定主题的问题(在本例中,您选择了bigquery
),您可以使用 JOIN 找到这些问题的答案。
Write a query that returns the id
, body
and owner_user_id
columns from the posts_answers
table for answers to "bigquery"-related questions.
编写一个查询,返回posts_answers
表中的id
、body
和owner_user_id
列,以获取bigquery
相关问题的答案。
- You should have one row in your results for each answer to a question that has "bigquery" in the tags.
- 对于标签中包含
bigquery
的问题,您的结果中应该有一行。 - Remember you can get the tags for a question from the
tags
column in theposts_questions
table. - 请记住,您可以从
posts_questions
表中的tags
列获取问题的标签。
Here's a reminder of what a JOIN looked like in the tutorial:
下面提醒一下本教程中 JOIN 的样子:
query = """
SELECT p.Name AS Pet_Name, o.Name AS Owner_Name
FROM `bigquery-public-data.pet_records.pets` as p
INNER JOIN `bigquery-public-data.pet_records.owners` as o
ON p.ID = o.Pet_ID
"""
It may be useful to scroll up and review the first several rows of the posts_answers
and posts_questions
tables.
向上滚动并查看posts_answers
和posts_questions
表的前几行可能会很有用。
answers_query = """
SELECT a.id, a.body, a.owner_user_id
FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
ON q.id = a.parent_id
WHERE q.tags LIKE '%bigquery%'
"""
# Set up the query (cancel the query if it would use too much of
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
answers_query_job = client.query(answers_query, job_config=safe_config)
# API request - run the query, and return a pandas DataFrame
answers_results = answers_query_job.to_dataframe()
# Preview results
print(answers_results.head())
id body owner_user_id
0 57032546 Another solution - same logic, just a diffe... 1391685.0
1 57032790
Recommendation :
\n\nselect\... 10369599.0
2 57036857 Basically, this is the answer
\n\n<... 11506172.0
3 57042242 first, I hope that year+week & year+day... 2929192.0
4 57049923
I think the solutions mentioned by others s... 1391685.0
# Check your answer
q_4.check()
/opt/conda/lib/python3.10/site-packages/learntools/sql/ex6.py:129: FutureWarning: Calling int on a single element Series is deprecated and will raise a TypeError in the future. Use int(ser.iloc[0]) instead
correct_owner_user_id = int(answers_answer[answers_answer["id"] == first_id]["owner_user_id"])
/opt/conda/lib/python3.10/site-packages/learntools/sql/ex6.py:130: FutureWarning: Calling int on a single element Series is deprecated and will raise a TypeError in the future. Use int(ser.iloc[0]) instead
submitted_owner_user_id = int(results[results["id"] == first_id]["owner_user_id"])
Correct
For a hint or the solution, uncomment the appropriate line below.
如需提示或解决方案,请取消注释下面相应的行。
# q_4.hint()
q_4.solution()
Solution:
answers_query = """
SELECT a.id, a.body, a.owner_user_id
FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
ON q.id = a.parent_id
WHERE q.tags LIKE '%bigquery%'
"""
# Set up the query (cancel the query if it would use too much of
# your quota, with the limit set to 27 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=27*10**10)
answers_query_job = client.query(answers_query, job_config=safe_config)
# API request - run the query, and return a pandas DataFrame
answers_results = answers_query_job.to_dataframe()
5) Answer the question
5) 回答问题
You have the merge you need. But you want a list of users who have answered many questions... which requires more work beyond your previous result.
您拥有所需的合并。 但是您想要一个回答了许多问题的用户列表...这需要比您之前的结果更多的工作。
Write a new query that has a single row for each user who answered at least one question with a tag that includes the string "bigquery". Your results should have two columns:
编写一个新查询,该查询对于每个回答至少一个问题(该问题的标签包含字符串bigquery
)的用户。 您的结果应该有两列:
user_id
- contains the owner_user_id
column from the posts_answers
table
user_id
- 包含 posts_answers
表中的 owner_user_id
列
number_of_answers
- contains the number of answers the user has written to "bigquery"-related questions
number_of_answers
- 包含用户写入bigquery
相关问题的答案数量
# Your code here
bigquery_experts_query = """
SELECT a.owner_user_id as user_id, count(1) as number_of_answers
FROM `bigquery-public-data.stackoverflow.posts_questions` as q
INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` as a
ON q.id = a.parent_id
WHERE q.tags like '%bigquery%'
group by a.owner_user_id
order by number_of_answers desc
"""
# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
bigquery_experts_query_job = client.query(bigquery_experts_query,safe_config) # Your code goes here
# API request - run the query, and return a pandas DataFrame
bigquery_experts_results = bigquery_experts_query_job.to_dataframe() # Your code goes here
# Preview results
print(bigquery_experts_results.head())
user_id number_of_answers
0 5221944.0 5203
1 1144035.0 1634
2 132438.0 898
3 6253347.0 737
4 1366527.0 620
# Check your answer
q_5.check()
/opt/conda/lib/python3.10/site-packages/learntools/sql/ex6.py:199: FutureWarning: Calling int on a single element Series is deprecated and will raise a TypeError in the future. Use int(ser.iloc[0]) instead
correct_num = int(bigquery_experts_answer[bigquery_experts_answer["user_id"] == first_id]["number_of_answers"])
/opt/conda/lib/python3.10/site-packages/learntools/sql/ex6.py:200: FutureWarning: Calling int on a single element Series is deprecated and will raise a TypeError in the future. Use int(ser.iloc[0]) instead
submitted_num = int(results[results["user_id"] == first_id]["number_of_answers"])
Correct
For a hint or the solution, uncomment the appropriate line below.
如需提示或解决方案,请取消注释下面相应的行。
#q_5.hint()
q_5.solution()
Solution:
bigquery_experts_query = """
SELECT a.owner_user_id AS user_id, COUNT(1) AS number_of_answers
FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
ON q.id = a.parent_Id
WHERE q.tags LIKE '%bigquery%'
GROUP BY a.owner_user_id
"""
# Set up the query (cancel the query if it would use too much of
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
bigquery_experts_query_job = client.query(bigquery_experts_query, job_config=safe_config)
# API request - run the query, and return a pandas DataFrame
bigquery_experts_results = bigquery_experts_query_job.to_dataframe()
6) Building a more generally useful service
6) 构建更普遍有用的服务
How could you convert what you've done to a general function a website could call on the backend to get experts on any topic?
如何将您所做的工作转换为网站可以在后端调用的通用函数,以获取任何主题的专家?
Think about it and then check the solution below.
想一想,然后检查下面的解决方案。
# Check your answer (Run this code cell to receive credit!)
q_6.solution()
Solution:
def expert_finder(topic, client):
'''
Returns a DataFrame with the user IDs who have written Stack Overflow answers on a topic.
Inputs:
topic: A string with the topic of interest
client: A Client object that specifies the connection to the Stack Overflow dataset
Outputs:
results: A DataFrame with columns for user_id and number_of_answers. Follows similar logic to bigquery_experts_results shown above.
'''
my_query = """
SELECT a.owner_user_id AS user_id, COUNT(1) AS number_of_answers
FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
ON q.id = a.parent_Id
WHERE q.tags like '%{topic}%'
GROUP BY a.owner_user_id
"""
# Set up the query (a real service would have good error handling for
# queries that scan too much data)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
my_query_job = client.query(my_query, job_config=safe_config)
# API request - run the query, and return a pandas DataFrame
results = my_query_job.to_dataframe()
return results
Congratulations!
恭喜!
You know all the key components to use BigQuery and SQL effectively. Your SQL skills are sufficient to unlock many of the world's largest datasets.
您了解有效使用 BigQuery 和 SQL 的所有关键组件。 您的 SQL 技能足以解锁许多世界上最大的数据集。
Want to go play with your new powers? Kaggle has BigQuery datasets available here.
想发挥你的新能力吗? Kaggle 在此处提供了 BigQuery 数据集。
06.exercise-joining-data【练习:关联数据】