This notebook is an exercise in the Data Cleaning course. You can reference the tutorial at this link.
In this exercise, you'll apply what you learned in the Handling missing values tutorial.
在本练习中,您将应用在处理缺失值教程中学到的知识。
Setup
设置
The questions below will give you feedback on your work. Run the following cell to set up the feedback system.
以下问题将为您提供有关您工作的反馈。 运行以下单元格来设置反馈系统。
from learntools.core import binder
binder.bind(globals())
from learntools.data_cleaning.ex1 import *
print("Setup Complete")
/opt/conda/lib/python3.10/site-packages/learntools/data_cleaning/ex1.py:6: DtypeWarning: Columns (22,32) have mixed types. Specify dtype option on import or set low_memory=False.
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")
/tmp/ipykernel_61/3419995878.py:3: DeprecationWarning: product
is deprecated as of NumPy 1.25.0, and will be removed in NumPy 2.0. Please use prod
instead.
from learntools.data_cleaning.ex1 import *
/opt/conda/lib/python3.10/site-packages/learntools/data_cleaning/ex1.py:69: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
_expected = sf_permits.fillna(method='bfill', axis=0).fillna(0)
Setup Complete
1) Take a first look at the data
1) 首先看一下数据
Run the next code cell to load in the libraries and dataset you'll use to complete the exercise.
运行下一个代码单元以加载您将用于完成练习的库和数据集。
# modules we'll use
import pandas as pd
import numpy as np
# read in all our data
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")
# set seed for reproducibility
np.random.seed(0)
/tmp/ipykernel_61/3534875831.py:6: DtypeWarning: Columns (22,32) have mixed types. Specify dtype option on import or set low_memory=False.
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")
Use the code cell below to print the first five rows of the sf_permits
DataFrame.
使用下面的代码单元格打印 DataFramesf_permits
的前五行。
# TODO: Your code here!
sf_permits.head()
Permit Number | Permit Type | Permit Type Definition | Permit Creation Date | Block | Lot | Street Number | Street Number Suffix | Street Name | Street Suffix | ... | Existing Construction Type | Existing Construction Type Description | Proposed Construction Type | Proposed Construction Type Description | Site Permit | Supervisor District | Neighborhoods - Analysis Boundaries | Zipcode | Location | Record ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201505065519 | 4 | sign - erect | 05/06/2015 | 0326 | 023 | 140 | NaN | Ellis | St | ... | 3.0 | constr type 3 | NaN | NaN | NaN | 3.0 | Tenderloin | 94102.0 | (37.785719256680785, -122.40852313194863) | 1380611233945 |
1 | 201604195146 | 4 | sign - erect | 04/19/2016 | 0306 | 007 | 440 | NaN | Geary | St | ... | 3.0 | constr type 3 | NaN | NaN | NaN | 3.0 | Tenderloin | 94102.0 | (37.78733980600732, -122.41063199757738) | 1420164406718 |
2 | 201605278609 | 3 | additions alterations or repairs | 05/27/2016 | 0595 | 203 | 1647 | NaN | Pacific | Av | ... | 1.0 | constr type 1 | 1.0 | constr type 1 | NaN | 3.0 | Russian Hill | 94109.0 | (37.7946573324287, -122.42232562979227) | 1424856504716 |
3 | 201611072166 | 8 | otc alterations permit | 11/07/2016 | 0156 | 011 | 1230 | NaN | Pacific | Av | ... | 5.0 | wood frame (5) | 5.0 | wood frame (5) | NaN | 3.0 | Nob Hill | 94109.0 | (37.79595867909168, -122.41557405519474) | 1443574295566 |
4 | 201611283529 | 6 | demolitions | 11/28/2016 | 0342 | 001 | 950 | NaN | Market | St | ... | 3.0 | constr type 3 | NaN | NaN | NaN | 6.0 | Tenderloin | 94102.0 | (37.78315261897309, -122.40950883997789) | 144548169992 |
5 rows × 43 columns
Does the dataset have any missing values? Once you have an answer, run the code cell below to get credit for your work.
数据集是否有缺失值? 找到答案后,运行下面的代码单元即可获得您的工作成果。
# Check your answer (Run this code cell to receive credit!)
q1.check()
Correct:
The first five rows of the data does show that several columns have missing values. You can see this in the "Street Number Suffix", "Proposed Construction Type" and "Site Permit" columns, among others.
数据的前五行确实显示有几列有缺失值。您可以在“街道号码后缀”、“拟议建筑类型”和“场地许可证”列等中看到这一点。
# Line below will give you a hint
#q1.hint()
2) How many missing data points do we have?
2) 我们有多少个缺失的数据点?
What percentage of the values in the dataset are missing? Your answer should be a number between 0 and 100. (If 1/4 of the values in the dataset are missing, the answer is 25.)
数据集中缺失值的百分比是多少? 您的答案应该是 0 到 100 之间的数字。(如果数据集中缺少 1/4 的值,则答案为 25。)
# TODO: Your code here!
percent_missing = sf_permits.isnull().sum().sum() / (sf_permits.count().sum() + sf_permits.isnull().sum().sum()) * 100
# Check your answer
q2.check()
Correct
# Lines below will give you a hint or solution code
# q2.hint()
q2.solution()
Solution:
# get the number of missing data points per column
missing_values_count = sf_permits.isnull().sum()
# how many total missing values do we have?
total_cells = np.product(sf_permits.shape)
total_missing = missing_values_count.sum()
# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
3) Figure out why the data is missing
3) 找出数据丢失的原因
Look at the columns "Street Number Suffix" and "Zipcode" from the San Francisco Building Permits dataset. Both of these contain missing values.
查看 旧金山建筑许可证数据集 中的 街道号码后缀
和 邮政编码
列。 这两个都包含缺失值。
- Which, if either, are missing because they don't exist?
- 哪些(如果有的话)因为不存在而丢失?
- Which, if either, are missing because they weren't recorded?
- 哪些(如果有)因未记录而丢失?
Once you have an answer, run the code cell below.
得到答案后,运行下面的代码单元格。
sf_permits[["Zipcode", "Street Number Suffix"]].isna().sum()
Zipcode 1716
Street Number Suffix 196684
dtype: int64
# Check your answer (Run this code cell to receive credit!)
q3.check()
Correct:
If a value in the "Street Number Suffix" column is missing, it is likely because it does not exist. If a value in the "Zipcode" column is missing, it was not recorded.
如果街道号码后缀
列中的值缺失,可能是因为该值不存在。 如果邮政编码
列中的值缺失,则不会记录该值。
# Line below will give you a hint
# q3.hint()
4) Drop missing values: rows
4) 按行删除缺失值
If you removed all of the rows of sf_permits
with missing values, how many rows are left?
如果删除了sf_permits
中所有缺失值的行,那么还剩下多少行?
Note: Do not change the value of sf_permits
when checking this.
注意:检查此项时不要更改sf_permits
的值。
# TODO: Your code here!
sf_permits.dropna().shape
(0, 43)
Once you have an answer, run the code cell below.
得到答案后,运行下面的代码单元格。
# Check your answer (Run this code cell to receive credit!)
q4.check()
Correct:
There are no rows remaining in the dataset!
# Line below will give you a hint
q4.hint()
Hint: Use sf_permits.dropna()
to drop all missing rows.
5) Drop missing values: columns
5) 按列删除缺失值
Now try removing all the columns with empty values.
现在尝试删除所有具有空值的列。
- Create a new DataFrame called
sf_permits_with_na_dropped
that has all of the columns with empty values removed. - 创建一个名为
sf_permits_with_na_dropped
的新 DataFrame,其中删除了所有空值列。 - How many columns were removed from the original
sf_permits
DataFrame? Use this number to set the value of thedropped_columns
variable below. - 从原始
sf_permits
DataFrame 中删除了多少列? 使用此数字设置下面dropped_columns
变量的值。
# TODO: Your code here
sf_permits_with_na_dropped = sf_permits.dropna(axis=1)
dropped_columns = len(set(sf_permits.columns)- set(sf_permits_with_na_dropped.columns))
# Check your answer
q5.check()
Correct
# Lines below will give you a hint or solution code
# q5.hint()
q5.solution()
Solution:
# remove all columns with at least one missing value
sf_permits_with_na_dropped = sf_permits.dropna(axis=1)
# calculate number of dropped columns
cols_in_original_dataset = sf_permits.shape[1]
cols_in_na_dropped = sf_permits_with_na_dropped.shape[1]
dropped_columns = cols_in_original_dataset - cols_in_na_dropped
6) Fill in missing values automatically
6) 自动填充缺失值
Try replacing all the NaN's in the sf_permits
data with the one that comes directly after it and then replacing any remaining NaN's with 0. Set the result to a new DataFrame sf_permits_with_na_imputed
.
尝试将sf_permits
数据中的所有 NaN 替换为紧随其后的数据,然后将所有剩余的 NaN 替换为 0。将结果设置为新的 DataFramesf_permits_with_na_impulated
。
# TODO: Your code here
sf_permits_with_na_imputed = sf_permits.bfill().fillna(value=0)
# Check your answer
q6.check()
Correct
# Lines below will give you a hint or solution code
# q6.hint()
q6.solution()
Solution:
sf_permits_with_na_imputed = sf_permits.fillna(method='bfill', axis=0).fillna(0)
More practice
更多的练习
如果您正在寻找更多处理缺失值的练习:
If you're looking for more practice handling missing values:
- Check out this noteboook on handling missing values using scikit-learn's imputer.
- 查看本笔记本,了解如何使用 scikit-learn 的输入器处理缺失值。
- Look back at the "Zipcode" column in the
sf_permits
dataset, which has some missing values. How would you go about figuring out what the actual zipcode of each address should be? (You might try using another dataset. You can search for datasets about San Fransisco on the Datasets listing.) - 回顾一下
sf_permits
数据集中的Zipcode
列,其中有一些缺失值。 您将如何计算出每个地址的实际邮政编码应该是什么? (您可以尝试使用其他数据集。您可以在数据集列表上搜索有关旧金山的数据集。)
Keep going
继续前进
In the next lesson, learn how to apply scaling and normalization to transform your data.
在下一课中,学习如何应用缩放和标准化来转换数据。