This notebook is an exercise in the SQL course. You can reference the tutorial at this link.
Introduction
介绍
The first test of your new data exploration skills uses data describing crime in the city of Chicago.
对您的新数据探索技能的首次测试,使用描述芝加哥市犯罪的数据。
Before you get started, run the following cell. It sets up the automated feedback system to review your answers.
在开始之前,请运行以下单元格。 它设置自动反馈系统来审查您的答案。
# Set up feedack system
from learntools.core import binder
binder.bind(globals())
from learntools.sql.ex1 import *
print("Setup Complete")
Using Kaggle's public dataset BigQuery integration.
Setup Complete
Use the next code cell to fetch the dataset.
使用下一个代码单元来获取数据集。
from google.cloud import bigquery
# Create a "Client" object
client = bigquery.Client()
# Construct a reference to the "chicago_crime" dataset
# 构建对“chicago_crime”数据集的引用
dataset_ref = client.dataset("chicago_crime", project="bigquery-public-data")
# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)
Using Kaggle's public dataset BigQuery integration.
Exercises
练习
1) Count tables in the dataset
1) 统计数据集中的表数
How many tables are in the Chicago Crime dataset?
芝加哥犯罪数据集中有多少张表?
# Write the code you need here to figure out the answer
tables = list(client.list_tables(dataset))
# Print names of all tables in the dataset (there are four!)
for table in tables:
print(table.table_id)
crime
num_tables = 1 # Store the answer as num_tables and then run this cell
# Check your answer
q_1.check()
Correct
For a hint or the solution, uncomment the appropriate line below.
# q_1.hint()
q_1.solution()
Solution:
# List all the tables in the "chicago_crime" dataset
tables = list(client.list_tables(dataset))
# Print number of tables in the dataset
print(len(tables))
num_tables = 1
2) Explore the table schema
2) 探索表模式
How many columns in the crime
table have TIMESTAMP
data?
crime
表中有多少列有TIMESTAMP
数据?
# Write the code to figure out the answer
table_ref = dataset_ref.table("crime")
table = client.get_table(table_ref)
field_count = 0
for schema in table.schema:
if schema.field_type == 'TIMESTAMP':
field_count += 1
print(field_count)
2
num_timestamp_fields = 2 # Put your answer here
# Check your answer
q_2.check()
Correct
For a hint or the solution, uncomment the appropriate line below.
如需提示或解决方案,请取消注释下面相应的行。
# q_2.hint()
q_2.solution()
Solution:
# Construct a reference to the "crime" table
table_ref = dataset_ref.table("crime")
# API request - fetch the table
table = client.get_table(table_ref)
# Print information on all the columns in the "crime" table in the "chicago_crime" dataset
print(table.schema)
num_timestamp_fields = 2
3) Create a crime map
3) 创建犯罪地图
If you wanted to create a map with a dot at the location of each crime, what are the names of the two fields you likely need to pull out of the crime
table to plot the crimes on a map?
如果您想创建一个在每个犯罪地点都有一个点的地图,那么您可能需要从犯罪
表中提取两个字段的名称来在地图上绘制犯罪情况?
# Write the code here to explore the data so you can find the answer
table.schema
[SchemaField('unique_key', 'INTEGER', 'REQUIRED', None, (), None),
SchemaField('case_number', 'STRING', 'NULLABLE', None, (), None),
SchemaField('date', 'TIMESTAMP', 'NULLABLE', None, (), None),
SchemaField('block', 'STRING', 'NULLABLE', None, (), None),
SchemaField('iucr', 'STRING', 'NULLABLE', None, (), None),
SchemaField('primary_type', 'STRING', 'NULLABLE', None, (), None),
SchemaField('description', 'STRING', 'NULLABLE', None, (), None),
SchemaField('location_description', 'STRING', 'NULLABLE', None, (), None),
SchemaField('arrest', 'BOOLEAN', 'NULLABLE', None, (), None),
SchemaField('domestic', 'BOOLEAN', 'NULLABLE', None, (), None),
SchemaField('beat', 'INTEGER', 'NULLABLE', None, (), None),
SchemaField('district', 'INTEGER', 'NULLABLE', None, (), None),
SchemaField('ward', 'INTEGER', 'NULLABLE', None, (), None),
SchemaField('community_area', 'INTEGER', 'NULLABLE', None, (), None),
SchemaField('fbi_code', 'STRING', 'NULLABLE', None, (), None),
SchemaField('x_coordinate', 'FLOAT', 'NULLABLE', None, (), None),
SchemaField('y_coordinate', 'FLOAT', 'NULLABLE', None, (), None),
SchemaField('year', 'INTEGER', 'NULLABLE', None, (), None),
SchemaField('updated_on', 'TIMESTAMP', 'NULLABLE', None, (), None),
SchemaField('latitude', 'FLOAT', 'NULLABLE', None, (), None),
SchemaField('longitude', 'FLOAT', 'NULLABLE', None, (), None),
SchemaField('location', 'STRING', 'NULLABLE', None, (), None)]
fields_for_plotting = ['latitude', 'longitude'] # Put your answers here
# Check your answer
q_3.check()
Correct
For a hint or the solution, uncomment the appropriate line below.
# q_3.hint()
q_3.solution()
Solution:
fields_for_plotting = ['latitude', 'longitude']
Thinking about the question above, there are a few columns that appear to have geographic data. Look at a few values (with the list_rows()
command) to see if you can determine their relationship. Two columns will still be hard to interpret. But it should be obvious how the location
column relates to latitude
and longitude
.
思考上面的问题,有几列似乎有地理数据。 查看几个值(使用list_rows()
命令),看看是否可以确定它们的关系。 两列仍然很难解释。 但位置
列与纬度
和经度
的关系应该是显而易见的。
# Scratch space for your code
client.list_rows(table, max_results=10).to_dataframe()[['latitude', 'longitude','location']]
latitude | longitude | location | |
---|---|---|---|
0 | 41.885879 | -87.630979 | (41.885879253, -87.630979425) |
1 | 41.882478 | -87.637028 | (41.882477553, -87.637028112) |
2 | 41.886013 | -87.629505 | (41.886012928, -87.629505318) |
3 | 41.885908 | -87.626289 | (41.885908101, -87.626289429) |
4 | 41.883375 | -87.629428 | (41.883375237, -87.629427772) |
5 | 41.885280 | -87.626267 | (41.885279532, -87.626266981) |
6 | 41.878864 | -87.638503 | (41.878864086, -87.638503099) |
7 | 41.885482 | -87.627936 | (41.885481891, -87.627935689) |
8 | 41.885347 | -87.627930 | (41.885347387, -87.627929824) |
9 | 41.878931 | -87.639586 | (41.878931004, -87.639585621) |
Keep going
继续前进
You've looked at the schema, but you haven't yet done anything exciting with the data itself. Things get more interesting when you get to the data, so keep going to write your first SQL query.
您已经查看了架构,但还没有对数据本身做任何令人兴奋的事情。 当您获取数据时,事情会变得更加有趣,因此请继续编写您的第一个 SQL 查询。