Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

02.exercise-missing-values【练习:缺失值】

This notebook is an exercise in the Intermediate Machine Learning course. You can reference the tutorial at this link.


Now it's your turn to test your new knowledge of missing values handling. You'll probably find it makes a big difference.

现在轮到您测试您对缺失值处理的新知识了。 您可能会发现它有很大的不同。

Setup

设置

The questions will give you feedback on your work. Run the following cell to set up the feedback system.

这些问题将为您提供有关您工作的反馈。 运行以下单元格来设置反馈系统。

# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex2 import *
print("Setup Complete")
Setup Complete

In this exercise, you will work with data from the Housing Prices Competition for Kaggle Learn Users.

在此练习中,您将使用 Kaggle Learn 用户房价竞赛 中的数据。

Ames Housing dataset image

Run the next code cell without changes to load the training and validation sets in X_train, X_valid, y_train, and y_valid. The test set is loaded in X_test.

在不进行任何更改的情况下运行下一个代码单元,以加载X_trainX_validy_trainy_valid中的训练集和验证集。 测试集加载到X_test中。

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
# 读取数据
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
# 去除缺失目标值的行,在预测过程将其分离开
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll use only numerical predictors
# 为了简单起见,我们将仅使用数值预测变量
X = X_full.select_dtypes(exclude=['object'])
X_test = X_test_full.select_dtypes(exclude=['object'])

# Break off validation set from training data
# 从训练数据中分离出验证集
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

Use the next code cell to print the first five rows of the data.

使用下一个代码单元格打印数据的前五行。

X_train.head()
MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 ... GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold
Id
619 20 90.0 11694 9 5 2007 2007 452.0 48 0 ... 774 0 108 0 0 260 0 0 7 2007
871 20 60.0 6600 5 5 1962 1962 0.0 0 0 ... 308 0 0 0 0 0 0 0 8 2009
93 30 80.0 13360 5 7 1921 2006 0.0 713 0 ... 432 0 0 44 0 0 0 0 8 2009
818 20 NaN 13265 8 5 2002 2002 148.0 1218 0 ... 857 150 59 0 0 0 0 0 7 2008
303 20 118.0 13704 7 5 2001 2002 150.0 0 0 ... 843 468 81 0 0 0 0 0 1 2006

5 rows × 36 columns

You can already see a few missing values in the first several rows. In the next step, you'll obtain a more comprehensive understanding of the missing values in the dataset.

您已经可以在前几行中看到一些缺失值。 在下一步中,您将更全面地了解数据集中的缺失值。

Step 1: Preliminary investigation

步骤一:初步调查

Run the code cell below without changes.

运行下面的代码单元而不进行任何更改。

# Shape of training data (num_rows, num_columns)
# 训练集的形状
print(X_train.shape)

# Number of missing values in each column of training data
# 每列训练数据中缺失值的数量
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])
(1168, 36)
LotFrontage    212
MasVnrArea       6
GarageYrBlt     58
dtype: int64
missing_val_count_by_column[missing_val_count_by_column > 0].sum()
276

Part A

A 部分

Use the above output to answer the questions below.

使用上面的输出回答下面的问题。

# Fill in the line below: How many rows are in the training data?
# num_rows = 1168

num_rows = X_train.shape[0]

# Fill in the line below: How many columns in the training data
# have missing values?
# num_cols_with_missing = 3
num_cols_with_missing = len(missing_val_count_by_column[missing_val_count_by_column > 0])

# Fill in the line below: How many missing entries are contained in 
# all of the training data?
# tot_missing = 276

tot_missing = missing_val_count_by_column[missing_val_count_by_column > 0].sum()

# Check your answers
step_1.a.check()

Correct

# Lines below will give you a hint or solution code
# step_1.a.hint()
step_1.a.solution()

Solution:

# How many rows are in the training data?
num_rows = 1168

# How many columns in the training data have missing values?
num_cols_with_missing = 3

# How many missing entries are contained in all of the training data?
tot_missing = 212 + 6 + 58

Part B

B 部分

Considering your answers above, what do you think is likely the best approach to dealing with the missing values?

考虑到您上面的答案,您认为处理缺失值的最佳方法可能是什么?

# Check your answer (Run this code cell to receive credit!)
step_1.b.check()

Correct:

Since there are relatively few missing entries in the data (the column with the greatest percentage of missing values is missing less than 20% of its entries), we can expect that dropping columns is unlikely to yield good results. This is because we'd be throwing away a lot of valuable data, and so imputation will likely perform better.

step_1.b.hint()

Hint: Does the dataset have a lot of missing values, or just a few? Would we lose much information if we completely ignored the columns with missing entries?

To compare different approaches to dealing with missing values, you'll use the same score_dataset() function from the tutorial. This function reports the mean absolute error (MAE) from a random forest model.

要比较处理缺失值的不同方法,您将使用教程中相同的score_dataset()函数。 此函数报告随机森林模型的平均绝对误差 (MAE)。

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

Step 2: Drop columns with missing values

第 2 步:删除缺失值的列

In this step, you'll preprocess the data in X_train and X_valid to remove columns with missing values. Set the preprocessed DataFrames to reduced_X_train and reduced_X_valid, respectively.

在此步骤中,您将预处理X_trainX_valid中的数据以删除缺少值的列。 将预处理后的 DataFrame 分别设置为reduced_X_trainreduced_X_valid

missing_val_count_by_column[missing_val_count_by_column > 0].index
Index(['LotFrontage', 'MasVnrArea', 'GarageYrBlt'], dtype='object')
# Fill in the line below: get names of columns with missing values
# 填写下面的行:获取缺少值的列的名称
missing_val_count_by_column[missing_val_count_by_column > 0].index # Your code here

# Fill in the lines below: drop columns in training and validation data
# 填写下面的行:删除训练和验证数据中的列

reduced_X_train = X_train.drop(missing_val_count_by_column[missing_val_count_by_column > 0].index, axis=1)
reduced_X_valid = X_valid.drop(missing_val_count_by_column[missing_val_count_by_column > 0].index, axis=1)

# Check your answers
step_2.check()

Correct

# Lines below will give you a hint or solution code
#step_2.hint()
step_2.solution()

Solution:

# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

Run the next code cell without changes to obtain the MAE for this approach.

在不进行任何更改的情况下运行下一个代码单元以获得此方法的 MAE。

print("MAE (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))
MAE (Drop columns with missing values):
17837.82570776256

Step 3: Imputation

第 3 步:插补

Part A

A 部分

Use the next code cell to impute missing values with the mean value along each column. Set the preprocessed DataFrames to imputed_X_train and imputed_X_valid. Make sure that the column names match those in X_train and X_valid.

使用下一个代码单元将缺失值估算为每列的平均值。 将预处理的 DataFrame 设置为impulated_X_trainimpulated_X_valid。 确保列名称与X_trainX_valid中的列名称匹配。

from sklearn.impute import SimpleImputer

# Fill in the lines below: imputation
# 插补
set_impute = SimpleImputer(strategy='mean') # Your code here
set_impute.fit(X_train)
02.exercise-missing-values【练习:缺失值】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top