This notebook is an exercise in the Data Cleaning course. You can reference the tutorial at this link.

In this exercise, you'll apply what you learned in the Handling missing values tutorial.

在本练习中，您将应用在处理缺失值教程中学到的知识。

Setup

设置

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

以下问题将为您提供有关您工作的反馈。运行以下单元格来设置反馈系统。

from learntools.core import binder
binder.bind(globals())
from learntools.data_cleaning.ex1 import *
print("Setup Complete")

/opt/conda/lib/python3.10/site-packages/learntools/data_cleaning/ex1.py:6: DtypeWarning: Columns (22,32) have mixed types. Specify dtype option on import or set low_memory=False.
  sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")
/tmp/ipykernel_61/3419995878.py:3: DeprecationWarning: product is deprecated as of NumPy 1.25.0, and will be removed in NumPy 2.0. Please use prod instead.
  from learntools.data_cleaning.ex1 import *
/opt/conda/lib/python3.10/site-packages/learntools/data_cleaning/ex1.py:69: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
  _expected = sf_permits.fillna(method='bfill', axis=0).fillna(0)

Setup Complete

1) Take a first look at the data

1) 首先看一下数据

Run the next code cell to load in the libraries and dataset you'll use to complete the exercise.

运行下一个代码单元以加载您将用于完成练习的库和数据集。

# modules we'll use
import pandas as pd
import numpy as np

# read in all our data
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")

# set seed for reproducibility
np.random.seed(0)

/tmp/ipykernel_61/3534875831.py:6: DtypeWarning: Columns (22,32) have mixed types. Specify dtype option on import or set low_memory=False.
  sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")

Use the code cell below to print the first five rows of the sf_permits DataFrame.

使用下面的代码单元格打印 DataFramesf_permits的前五行。

# TODO: Your code here!
sf_permits.head()

	Permit Number	Permit Type	Permit Type Definition	Permit Creation Date	Block	Lot	Street Number	Street Number Suffix	Street Name	Street Suffix	...	Existing Construction Type	Existing Construction Type Description	Proposed Construction Type	Proposed Construction Type Description	Site Permit	Supervisor District	Neighborhoods - Analysis Boundaries	Zipcode	Location	Record ID
0	201505065519	4	sign - erect	05/06/2015	0326	023	140	NaN	Ellis	St	...	3.0	constr type 3	NaN	NaN	NaN	3.0	Tenderloin	94102.0	(37.785719256680785, -122.40852313194863)	1380611233945
1	201604195146	4	sign - erect	04/19/2016	0306	007	440	NaN	Geary	St	...	3.0	constr type 3	NaN	NaN	NaN	3.0	Tenderloin	94102.0	(37.78733980600732, -122.41063199757738)	1420164406718
2	201605278609	3	additions alterations or repairs	05/27/2016	0595	203	1647	NaN	Pacific	Av	...	1.0	constr type 1	1.0	constr type 1	NaN	3.0	Russian Hill	94109.0	(37.7946573324287, -122.42232562979227)	1424856504716
3	201611072166	8	otc alterations permit	11/07/2016	0156	011	1230	NaN	Pacific	Av	...	5.0	wood frame (5)	5.0	wood frame (5)	NaN	3.0	Nob Hill	94109.0	(37.79595867909168, -122.41557405519474)	1443574295566
4	201611283529	6	demolitions	11/28/2016	0342	001	950	NaN	Market	St	...	3.0	constr type 3	NaN	NaN	NaN	6.0	Tenderloin	94102.0	(37.78315261897309, -122.40950883997789)	144548169992

5 rows × 43 columns

Does the dataset have any missing values? Once you have an answer, run the code cell below to get credit for your work.

数据集是否有缺失值？找到答案后，运行下面的代码单元即可获得您的工作成果。

# Check your answer (Run this code cell to receive credit!)
q1.check()

Correct:

The first five rows of the data does show that several columns have missing values. You can see this in the "Street Number Suffix", "Proposed Construction Type" and "Site Permit" columns, among others.

数据的前五行确实显示有几列有缺失值。您可以在“街道号码后缀”、“拟议建筑类型”和“场地许可证”列等中看到这一点。

# Line below will give you a hint
#q1.hint()

2) How many missing data points do we have?

2) 我们有多少个缺失的数据点？

What percentage of the values in the dataset are missing? Your answer should be a number between 0 and 100. (If 1/4 of the values in the dataset are missing, the answer is 25.)

数据集中缺失值的百分比是多少？您的答案应该是 0 到 100 之间的数字。（如果数据集中缺少 1/4 的值，则答案为 25。）

# TODO: Your code here!
percent_missing = sf_permits.isnull().sum().sum() / (sf_permits.count().sum() + sf_permits.isnull().sum().sum()) * 100

# Check your answer
q2.check()

Correct

# Lines below will give you a hint or solution code
# q2.hint()
q2.solution()

Solution:

# get the number of missing data points per column
missing_values_count = sf_permits.isnull().sum()

# how many total missing values do we have?
total_cells = np.product(sf_permits.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100

3) Figure out why the data is missing

3) 找出数据丢失的原因

Look at the columns "Street Number Suffix" and "Zipcode" from the San Francisco Building Permits dataset. Both of these contain missing values.

查看旧金山建筑许可证数据集中的 街道号码后缀 和 邮政编码 列。这两个都包含缺失值。

Which, if either, are missing because they don't exist?
哪些（如果有的话）因为不存在而丢失？
Which, if either, are missing because they weren't recorded?
哪些（如果有）因未记录而丢失？

Once you have an answer, run the code cell below.

得到答案后，运行下面的代码单元格。

sf_permits[["Zipcode", "Street Number Suffix"]].isna().sum()

Zipcode                   1716
Street Number Suffix    196684
dtype: int64

# Check your answer (Run this code cell to receive credit!)
q3.check()

Correct:

If a value in the "Street Number Suffix" column is missing, it is likely because it does not exist. If a value in the "Zipcode" column is missing, it was not recorded.

如果街道号码后缀列中的值缺失，可能是因为该值不存在。如果邮政编码列中的值缺失，则不会记录该值。

# Line below will give you a hint
# q3.hint()

4) Drop missing values: rows

4) 按行删除缺失值

If you removed all of the rows of sf_permits with missing values, how many rows are left?

如果删除了sf_permits中所有缺失值的行，那么还剩下多少行？

Note: Do not change the value of sf_permits when checking this.

注意：检查此项时不要更改sf_permits的值。

# TODO: Your code here!
sf_permits.dropna().shape

(0, 43)

Once you have an answer, run the code cell below.

得到答案后，运行下面的代码单元格。

# Check your answer (Run this code cell to receive credit!)
q4.check()

Correct:

There are no rows remaining in the dataset!

# Line below will give you a hint
q4.hint()

Hint: Use sf_permits.dropna() to drop all missing rows.

5) Drop missing values: columns

5) 按列删除缺失值

Now try removing all the columns with empty values.

现在尝试删除所有具有空值的列。

Create a new DataFrame called sf_permits_with_na_dropped that has all of the columns with empty values removed.
创建一个名为sf_permits_with_na_dropped的新 DataFrame，其中删除了所有空值列。
How many columns were removed from the original sf_permits DataFrame? Use this number to set the value of the dropped_columns variable below.
从原始 sf_permits DataFrame 中删除了多少列？使用此数字设置下面dropped_columns变量的值。

# TODO: Your code here
sf_permits_with_na_dropped = sf_permits.dropna(axis=1)

dropped_columns = len(set(sf_permits.columns)- set(sf_permits_with_na_dropped.columns))

# Check your answer
q5.check()

Correct

# Lines below will give you a hint or solution code
# q5.hint()
q5.solution()

Solution:

# remove all columns with at least one missing value
sf_permits_with_na_dropped = sf_permits.dropna(axis=1)

# calculate number of dropped columns
cols_in_original_dataset = sf_permits.shape[1]
cols_in_na_dropped = sf_permits_with_na_dropped.shape[1]
dropped_columns = cols_in_original_dataset - cols_in_na_dropped

6) Fill in missing values automatically

6) 自动填充缺失值

Try replacing all the NaN's in the sf_permits data with the one that comes directly after it and then replacing any remaining NaN's with 0. Set the result to a new DataFrame sf_permits_with_na_imputed.

尝试将sf_permits数据中的所有 NaN 替换为紧随其后的数据，然后将所有剩余的 NaN 替换为 0。将结果设置为新的 DataFramesf_permits_with_na_impulated。

# TODO: Your code here
sf_permits_with_na_imputed = sf_permits.bfill().fillna(value=0)

# Check your answer
q6.check()

Correct

# Lines below will give you a hint or solution code
# q6.hint()
q6.solution()

Solution:

sf_permits_with_na_imputed = sf_permits.fillna(method='bfill', axis=0).fillna(0)

More practice

Keep going

继续前进

In the next lesson, learn how to apply scaling and normalization to transform your data.

在下一课中，学习如何应用缩放和标准化来转换数据。

01.exercise-handling-missing-values【练习：处理缺失值】

Setup

设置

1) Take a first look at the data

1) 首先看一下数据

2) How many missing data points do we have?

2) 我们有多少个缺失的数据点？

3) Figure out why the data is missing

3) 找出数据丢失的原因

4) Drop missing values: rows

4) 按行删除缺失值

5) Drop missing values: columns

5) 按列删除缺失值

6) Fill in missing values automatically

6) 自动填充缺失值

More practice

更多的练习

Keep going

继续前进

Leave a Reply Cancel reply