Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

06.exercise-joining-data【练习:关联数据】

This notebook is an exercise in the SQL course. You can reference the tutorial at this link.


Introduction

介绍

Stack Overflow is a widely beloved question and answer site for technical questions. You'll probably use it yourself as you keep using SQL (or any programming language).

Stack Overflow 是一个广受欢迎的技术问题问答网站。 当您继续使用 SQL(或任何编程语言)时,您可能会使用它。

Their data is publicly available. What cool things do you think it would be useful for?

他们的数据是公开的。 您认为它可以用于哪些很酷的事情?

Here's one idea:

有一个想法:

You could set up a service that identifies the Stack Overflow users who have demonstrated expertise with a specific technology by answering related questions about it, so someone could hire those experts for in-depth help.

您可以设置一项服务,通过回答相关问题来识别具有特定技术专业知识的 Stack Overflow 用户,这样有人就可以聘请这些专家来提供深入的帮助。

In this exercise, you'll write the SQL queries that might serve as the foundation for this type of service.

在本练习中,您将编写可作为此类服务基础的 SQL 查询。

As usual, run the following cell to set up our feedback system before moving on.

像往常一样,在继续之前运行以下单元格来设置我们的反馈系统。

# Set up feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.sql.ex6 import *
print("Setup Complete")
Using Kaggle's public dataset BigQuery integration.
Setup Complete

Run the next cell to fetch the stackoverflow dataset.

运行下一个单元格以获取stackoverflow数据集。

from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "stackoverflow" dataset
dataset_ref = client.dataset("stackoverflow", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)
Using Kaggle's public dataset BigQuery integration.

Exercises

练习

1) Explore the data

1) 探索数据

Before writing queries or JOIN clauses, you'll want to see what tables are available.

在编写查询或 JOIN 子句之前,您需要查看哪些表可用。

Hint: Tab completion is helpful whenever you can't remember a command. Type client. and then hit the tab key. Don't forget the period before hitting tab.

提示:每当您忘记命令时,制表符补全都会很有帮助。 输入client.,然后按 Tab 键。 不要忘记在点击 Tab 之前需要加上句点。

tables = list(client.list_tables(dataset))
# Get a list of available tables 
list_of_tables = [table.table_id for table in tables]  # Your code here

# Print your answer
print(list_of_tables)

# Check your answer
q_1.check()
['badges', 'comments', 'post_history', 'post_links', 'posts_answers', 'posts_moderator_nomination', 'posts_orphaned_tag_wiki', 'posts_privilege_wiki', 'posts_questions', 'posts_tag_wiki', 'posts_tag_wiki_excerpt', 'posts_wiki_placeholder', 'stackoverflow_posts', 'tags', 'users', 'votes']

Correct

For the solution, uncomment the line below.

对于解决方案,请取消注释下面的行。

q_1.solution()

Solution:


# Get a list of available tables 
tables = list(client.list_tables(dataset))
list_of_tables = [table.table_id for table in tables] 

2) Review relevant tables

2) 查看相关表格

If you are interested in people who answer questions on a given topic, the posts_answers table is a natural place to look. Run the following cell, and look at the output.

如果您对回答特定主题问题的人感兴趣,那么posts_answers表是一个自然的查找位置。 运行以下单元格,然后查看输出。

# Construct a reference to the "posts_answers" table
answers_table_ref = dataset_ref.table("posts_answers")

# API request - fetch the table
answers_table = client.get_table(answers_table_ref)

# Preview the first five lines of the "posts_answers" table
client.list_rows(answers_table, max_results=5).to_dataframe()
id title body accepted_answer_id answer_count comment_count community_owned_date creation_date favorite_count last_activity_date last_edit_date last_editor_display_name last_editor_user_id owner_display_name owner_user_id parent_id post_type_id score tags view_count
0 18 None <p>For a table like this:</p>\n\n<pre><code>CR... None None 2 NaT 2008-08-01 05:12:44.193000+00:00 None 2016-06-02 05:56:26.060000+00:00 2016-06-02 05:56:26.060000+00:00 Jeff Atwood 126039 phpguy NaN 17 2 59 None None
1 165 None <p>You can use a <a href="http://sharpdevelop.... None None 0 NaT 2008-08-01 18:04:25.023000+00:00 None 2019-04-06 14:03:51.080000+00:00 2019-04-06 14:03:51.080000+00:00 None 1721793 user2189331 NaN 145 2 10 None None
2 1028 None <p>The VB code looks something like this:</p>\... None None 0 NaT 2008-08-04 04:58:40.300000+00:00 None 2013-02-07 13:22:14.680000+00:00 2013-02-07 13:22:14.680000+00:00 None 395659 user2189331 NaN 947 2 8 None None
3 1073 None <p>My first choice would be a dedicated heap t... None None 0 NaT 2008-08-04 07:51:02.997000+00:00 None 2015-09-01 17:32:32.120000+00:00 2015-09-01 17:32:32.120000+00:00 None 45459 user2189331 NaN 1069 2 29 None None
4 1260 None <p>I found the answer. all you have to do is a... None None 0 NaT 2008-08-04 14:06:02.863000+00:00 None 2016-12-20 08:38:48.867000+00:00 2016-12-20 08:38:48.867000+00:00 None 1221571 Jin NaN 1229 2 1 None None

It isn't clear yet how to find users who answered questions on any given topic. But posts_answers has a parent_id column. If you are familiar with the Stack Overflow site, you might figure out that the parent_id is the question each post is answering.

目前尚不清楚如何找到回答过任何给定主题问题的用户。 但是 posts_answers 有一个 parent_id 列。 如果您熟悉 Stack Overflow 站点,您可能会发现parent_id是每个帖子正在回答的问题。

Look at posts_questions using the cell below.

使用下面的单元格查看posts_questions

# Construct a reference to the "posts_questions" table
questions_table_ref = dataset_ref.table("posts_questions")

# API request - fetch the table
questions_table = client.get_table(questions_table_ref)

# Preview the first five lines of the "posts_questions" table
client.list_rows(questions_table, max_results=5).to_dataframe()
id title body accepted_answer_id answer_count comment_count community_owned_date creation_date favorite_count last_activity_date last_edit_date last_editor_display_name last_editor_user_id owner_display_name owner_user_id parent_id post_type_id score tags view_count
0 320268 Html.ActionLink doesn’t render # properly <p>When using Html.ActionLink passing a string... NaN 0 0 NaT 2008-11-26 10:42:37.477000+00:00 0 2009-02-06 20:13:54.370000+00:00 NaT None NaN Paulo NaN None 1 0 asp.net-mvc 390
1 324003 Primitive recursion <p>how will i define the function 'simplify' ... NaN 0 0 NaT 2008-11-27 15:12:37.497000+00:00 0 2012-09-25 19:54:40.597000+00:00 2012-09-25 19:54:40.597000+00:00 Marcin 1288.0 None 41000.0 None 1 0 haskell|lambda|functional-programming|lambda-c... 497
2 390605 While vs. Do While <p>I've seen both the blocks of code in use se... 390608.0 0 0 NaT 2008-12-24 01:49:54.230000+00:00 2 2008-12-24 03:08:55.897000+00:00 NaT None NaN Unkwntech 115.0 None 1 0 language-agnostic|loops 11262
3 413246 Protect ASP.NET Source code <p>Im currently doing some research in how to ... NaN 0 0 NaT 2009-01-05 14:23:51.040000+00:00 0 2009-03-24 21:30:22.370000+00:00 2009-01-05 14:42:28.257000+00:00 Tom Anderson 13502.0 Velnias NaN None 1 0 asp.net|deployment|obfuscation 4823
4 454921 Difference between "int[] myArray" and "int my... <blockquote>\n <p><strong>Possible Duplicate:... 454928.0 0 0 NaT 2009-01-18 10:22:52.177000+00:00 0 2009-01-18 10:30:50.930000+00:00 2017-05-23 11:49:26.567000+00:00 None -1.0 Evan Fosmark 49701.0 None 1 0 java|arrays 798

Are there any fields that identify what topic or technology each question is about? If so, how could you find the IDs of users who answered questions about a specific topic?

是否有任何字段可以确定每个问题所涉及的主题或技术? 如果是这样,您如何找到回答特定主题问题的用户的 ID?

Think about it, and then check the solution by running the code in the next cell.

想一想,然后通过运行下一个单元格中的代码来检查解决方案。

# Check your answer (Run this code cell to receive credit!)
q_2.solution()

Solution:
posts_questions has a column called tags which lists the topics/technologies each question is about.

posts_answers has a column called parent_id which identifies the ID of the question each answer is responding to.
posts_answers also has an owner_user_id column which specifies the ID of the user who answered the question.

You can join these two tables to:

  • determine the tags for each answer, and then
  • select the owner_user_id of the answers on the desired tag.

This is exactly what you will do over the next few questions.

3) Selecting the right questions

3) 选择正确的问题

A lot of this data is text.

这些数据很多都是文本。

We'll explore one last technique in this course which you can apply to this text.

我们将在本课程中探讨最后一项技术,您可以将其应用于本文。

A WHERE clause can limit your results to rows with certain text using the LIKE feature. For example, to select just the third row of the pets table from the tutorial, we could use the query in the picture below.

WHERE 子句可以使用 LIKE 功能将结果限制为包含某些文本的行。 例如,要从教程中仅选择pets表的第三行,我们可以使用下图中的查询。

You can also use % as a "wildcard" for any number of characters. So you can also get the third row with:

您还可以使用%作为任意数量字符的通配符。 所以你还可以通过以下方式获得第三行:

query = """
        SELECT * 
        FROM `bigquery-public-data.pet_records.pets` 
        WHERE Name LIKE '%ipl%'
        """

Try this yourself. Write a query that selects the id, title and owner_user_id columns from the posts_questions table.

你自己试试。 编写一个查询,从posts_questions表中选择idtitleowner_user_id列。

  • Restrict the results to rows that contain the word "bigquery" in the tags column.
  • 将结果限制为tags列中包含单词bigquery的行。
  • Include rows where there is other text in addition to the word "bigquery" (e.g., if a row has a tag "bigquery-sql", your results should include that too).
  • 包括除bigquery一词之外还有其他文本的行(例如,如果一行具有标签bigquery-sql,则您的结果也应包括该标签)。
# Your code here
questions_query = """
                  SELECT id, title, owner_user_id
                  FROM `bigquery-public-data.stackoverflow.posts_questions`
                  WHERE tags like '%bigquery%'
                  """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
questions_query_job = client.query(questions_query, safe_config) # Your code goes here
# API request - run the query, and return a pandas DataFrame
questions_results = questions_query_job.to_dataframe() # Your code goes here

# Preview results
print(questions_results.head())
         id                                              title  owner_user_id
0  64345717  Loop by array and union looped result in BigQuery     13304769.0
1  64610766  BigQuery Transfer jobs from S3 stuck pending o...     14549617.0
2  64383871  How to get sum of values in days intervals usi...     12472644.0
3  64251948                BigQuery get row above empty column      4572124.0
4  64323398  SQL: Remove part of string that is in another ...      6089137.0
# Check your answer
q_3.check()

Correct

For a hint or the solution, uncomment the appropriate line below.

q_3.hint()
q_3.solution()

Hint: Your WHERE clause should be WHERE tags LIKE '%bigquery%'.

Solution:


questions_query = """
                  SELECT id, title, owner_user_id
                  FROM `bigquery-public-data.stackoverflow.posts_questions`
                  WHERE tags LIKE '%bigquery%'
                  """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
questions_query_job = client.query(questions_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
questions_results = questions_query_job.to_dataframe()

4) Your first join

4) 你的第一个JOIN

Now that you have a query to select questions on any given topic (in this case, you chose "bigquery"), you can find the answers to those questions with a JOIN.

现在您有一个查询来选择任何给定主题的问题(在本例中,您选择了bigquery),您可以使用 JOIN 找到这些问题的答案。

Write a query that returns the id, body and owner_user_id columns from the posts_answers table for answers to "bigquery"-related questions.

编写一个查询,返回posts_answers表中的idbodyowner_user_id列,以获取bigquery相关问题的答案。

  • You should have one row in your results for each answer to a question that has "bigquery" in the tags.
  • 对于标签中包含bigquery的问题,您的结果中应该有一行。
  • Remember you can get the tags for a question from the tags column in the posts_questions table.
  • 请记住,您可以从posts_questions表中的tags列获取问题的标签。

Here's a reminder of what a JOIN looked like in the tutorial:

下面提醒一下本教程中 JOIN 的样子:

query = """
        SELECT p.Name AS Pet_Name, o.Name AS Owner_Name
        FROM `bigquery-public-data.pet_records.pets` as p
        INNER JOIN `bigquery-public-data.pet_records.owners` as o 
            ON p.ID = o.Pet_ID
        """

It may be useful to scroll up and review the first several rows of the posts_answers and posts_questions tables.

向上滚动并查看posts_answersposts_questions表的前几行可能会很有用。

answers_query = """
                SELECT a.id, a.body, a.owner_user_id
                FROM `bigquery-public-data.stackoverflow.posts_questions` AS q 
                INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                    ON q.id = a.parent_id
                WHERE q.tags LIKE '%bigquery%'
                """
# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
answers_query_job = client.query(answers_query, job_config=safe_config)
# API request - run the query, and return a pandas DataFrame
answers_results = answers_query_job.to_dataframe()
# Preview results
print(answers_results.head())
         id                                               body  owner_user_id
0  57032546  

Another solution - same logic, just a diffe... 1391685.0 1 57032790

Recommendation :

\n\n
select\...     10369599.0
2  57036857  

Basically, this is the answer

\n\n
<...     11506172.0
3  57042242  

first, I hope that year+week & year+day... 2929192.0 4 57049923

I think the solutions mentioned by others s... 1391685.0

# Check your answer
q_4.check()
/opt/conda/lib/python3.10/site-packages/learntools/sql/ex6.py:129: FutureWarning: Calling int on a single element Series is deprecated and will raise a TypeError in the future. Use int(ser.iloc[0]) instead
  correct_owner_user_id = int(answers_answer[answers_answer["id"] == first_id]["owner_user_id"])
/opt/conda/lib/python3.10/site-packages/learntools/sql/ex6.py:130: FutureWarning: Calling int on a single element Series is deprecated and will raise a TypeError in the future. Use int(ser.iloc[0]) instead
  submitted_owner_user_id = int(results[results["id"] == first_id]["owner_user_id"])

Correct

For a hint or the solution, uncomment the appropriate line below.

如需提示或解决方案,请取消注释下面相应的行。

# q_4.hint()
q_4.solution()

Solution:


answers_query = """
                SELECT a.id, a.body, a.owner_user_id
                FROM `bigquery-public-data.stackoverflow.posts_questions` AS q 
                INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                    ON q.id = a.parent_id
                WHERE q.tags LIKE '%bigquery%'
                """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 27 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=27*10**10)
answers_query_job = client.query(answers_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
answers_results = answers_query_job.to_dataframe()

5) Answer the question

5) 回答问题

You have the merge you need. But you want a list of users who have answered many questions... which requires more work beyond your previous result.

您拥有所需的合并。 但是您想要一个回答了许多问题的用户列表...这需要比您之前的结果更多的工作。

Write a new query that has a single row for each user who answered at least one question with a tag that includes the string "bigquery". Your results should have two columns:

编写一个新查询,该查询对于每个回答至少一个问题(该问题的标签包含字符串bigquery)的用户。 您的结果应该有两列:

  • user_id - contains the owner_user_id column from the posts_answers table
  • user_id - 包含 posts_answers 表中的 owner_user_id
  • number_of_answers - contains the number of answers the user has written to "bigquery"-related questions
  • number_of_answers - 包含用户写入bigquery相关问题的答案数量
# Your code here
bigquery_experts_query = """
                            SELECT a.owner_user_id as user_id, count(1) as number_of_answers
                            FROM `bigquery-public-data.stackoverflow.posts_questions` as q
                            INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` as a
                                ON q.id = a.parent_id
                            WHERE q.tags like '%bigquery%'
                            group by a.owner_user_id
                            order by number_of_answers desc
                            """

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
bigquery_experts_query_job = client.query(bigquery_experts_query,safe_config) # Your code goes here
# API request - run the query, and return a pandas DataFrame
bigquery_experts_results = bigquery_experts_query_job.to_dataframe() # Your code goes here

# Preview results
print(bigquery_experts_results.head())
     user_id  number_of_answers
0  5221944.0               5203
1  1144035.0               1634
2   132438.0                898
3  6253347.0                737
4  1366527.0                620
# Check your answer
q_5.check()
/opt/conda/lib/python3.10/site-packages/learntools/sql/ex6.py:199: FutureWarning: Calling int on a single element Series is deprecated and will raise a TypeError in the future. Use int(ser.iloc[0]) instead
  correct_num = int(bigquery_experts_answer[bigquery_experts_answer["user_id"] == first_id]["number_of_answers"])
/opt/conda/lib/python3.10/site-packages/learntools/sql/ex6.py:200: FutureWarning: Calling int on a single element Series is deprecated and will raise a TypeError in the future. Use int(ser.iloc[0]) instead
  submitted_num = int(results[results["user_id"] == first_id]["number_of_answers"])

Correct

For a hint or the solution, uncomment the appropriate line below.

如需提示或解决方案,请取消注释下面相应的行。

#q_5.hint()
q_5.solution()

Solution:


bigquery_experts_query = """
                         SELECT a.owner_user_id AS user_id, COUNT(1) AS number_of_answers
                         FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                         INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                             ON q.id = a.parent_Id
                         WHERE q.tags LIKE '%bigquery%'
                         GROUP BY a.owner_user_id
                         """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
bigquery_experts_query_job = client.query(bigquery_experts_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
bigquery_experts_results = bigquery_experts_query_job.to_dataframe()

6) Building a more generally useful service

6) 构建更普遍有用的服务

How could you convert what you've done to a general function a website could call on the backend to get experts on any topic?

如何将您所做的工作转换为网站可以在后端调用的通用函数,以获取任何主题的专家?

Think about it and then check the solution below.

想一想,然后检查下面的解决方案。

# Check your answer (Run this code cell to receive credit!)
q_6.solution()

Solution:


def expert_finder(topic, client):
    '''
    Returns a DataFrame with the user IDs who have written Stack Overflow answers on a topic.

    Inputs:
        topic: A string with the topic of interest
        client: A Client object that specifies the connection to the Stack Overflow dataset

    Outputs:
        results: A DataFrame with columns for user_id and number_of_answers. Follows similar logic to bigquery_experts_results shown above.
    '''
    my_query = """
               SELECT a.owner_user_id AS user_id, COUNT(1) AS number_of_answers
               FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
               INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                   ON q.id = a.parent_Id
               WHERE q.tags like '%{topic}%'
               GROUP BY a.owner_user_id
               """

    # Set up the query (a real service would have good error handling for 
    # queries that scan too much data)
    safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)      
    my_query_job = client.query(my_query, job_config=safe_config)

    # API request - run the query, and return a pandas DataFrame
    results = my_query_job.to_dataframe()

    return results

Congratulations!

恭喜!

You know all the key components to use BigQuery and SQL effectively. Your SQL skills are sufficient to unlock many of the world's largest datasets.

您了解有效使用 BigQuery 和 SQL 的所有关键组件。 您的 SQL 技能足以解锁许多世界上最大的数据集。

Want to go play with your new powers? Kaggle has BigQuery datasets available here.

想发挥你的新能力吗? Kaggle 在此处提供了 BigQuery 数据集。

06.exercise-joining-data【练习:关联数据】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top