Welcome to the Data Cleaning course on Kaggle Learn!
欢迎来到 Kaggle Learn 上的数据清理课程!
Data cleaning is a key part of data science, but it can be deeply frustrating. Why are some of your text fields garbled? What should you do about those missing values? Why aren’t your dates formatted correctly? How can you quickly clean up inconsistent data entry? In this course, you'll learn why you've run into these problems and, more importantly, how to fix them!
数据清理是数据科学的关键部分,但它可能会令人深感沮丧。 为什么有些文本字段出现乱码? 对于那些缺失的值你应该做什么? 为什么您的日期格式不正确? 如何快速清理不一致的数据输入? 在本课程中,您将了解为什么会遇到这些问题,更重要的是,如何解决它们!
In this course, you’ll learn how to tackle some of the most common data cleaning problems so you can get to actually analyzing your data faster. You’ll work through five hands-on exercises with real, messy data and answer some of your most commonly-asked data cleaning questions.
在本课程中,您将学习如何解决一些最常见的数据清理问题,以便您可以更快地实际分析数据。 您将使用真实、混乱的数据完成五个实践练习,并回答一些最常见的数据清理问题。
In this notebook, we'll look at how to deal with missing values.
在本笔记本中,我们将了解如何处理缺失值。
Take a first look at the data
首先看一下数据
The first thing we'll need to do is load in the libraries and dataset we'll be using.
我们需要做的第一件事是加载我们将使用的库和数据集。
For demonstration, we'll use a dataset of events that occured in American Football games. In the following exercise, you'll apply your new skills to a dataset of building permits issued in San Francisco.
为了进行演示,我们将使用美式橄榄球比赛中发生的事件的数据集。 在以下练习中,您将把新技能应用于旧金山颁发的建筑许可证数据集。
# modules we'll use
import pandas as pd
import numpy as np
# read in all our data
nfl_data = pd.read_csv("../00 datasets/maxhorowitz/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv")
# set seed for reproducibility
np.random.seed(0)
/tmp/ipykernel_13494/4062007822.py:6: DtypeWarning: Columns (25,51) have mixed types. Specify dtype option on import or set low_memory=False.
nfl_data = pd.read_csv("../00 datasets/maxhorowitz/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv")
The first thing to do when you get a new dataset is take a look at some of it. This lets you see that it all read in correctly and gives an idea of what's going on with the data. In this case, let's see if there are any missing values, which will be reprsented with NaN
or None
.
当您获得新数据集时要做的第一件事就是查看其中的一些内容。 这可以让您看到所有内容都正确读取,并了解数据的情况。 在这种情况下,让我们看看是否有任何缺失值,这些值将用NaN
或None
表示。
# look at the first five rows of the nfl_data file.
# I can see a handful of missing data already!
nfl_data.head()
Date | GameID | Drive | qtr | down | time | TimeUnder | TimeSecs | PlayTimeDiff | SideofField | ... | yacEPA | Home_WP_pre | Away_WP_pre | Home_WP_post | Away_WP_post | Win_Prob | WPA | airWPA | yacWPA | Season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2009-09-10 | 2009091000 | 1 | 1 | NaN | 15:00 | 15 | 3600.0 | 0.0 | TEN | ... | NaN | 0.485675 | 0.514325 | 0.546433 | 0.453567 | 0.485675 | 0.060758 | NaN | NaN | 2009 |
1 | 2009-09-10 | 2009091000 | 1 | 1 | 1.0 | 14:53 | 15 | 3593.0 | 7.0 | PIT | ... | 1.146076 | 0.546433 | 0.453567 | 0.551088 | 0.448912 | 0.546433 | 0.004655 | -0.032244 | 0.036899 | 2009 |
2 | 2009-09-10 | 2009091000 | 1 | 1 | 2.0 | 14:16 | 15 | 3556.0 | 37.0 | PIT | ... | NaN | 0.551088 | 0.448912 | 0.510793 | 0.489207 | 0.551088 | -0.040295 | NaN | NaN | 2009 |
3 | 2009-09-10 | 2009091000 | 1 | 1 | 3.0 | 13:35 | 14 | 3515.0 | 41.0 | PIT | ... | -5.031425 | 0.510793 | 0.489207 | 0.461217 | 0.538783 | 0.510793 | -0.049576 | 0.106663 | -0.156239 | 2009 |
4 | 2009-09-10 | 2009091000 | 1 | 1 | 4.0 | 13:27 | 14 | 3507.0 | 8.0 | PIT | ... | NaN | 0.461217 | 0.538783 | 0.558929 | 0.441071 | 0.461217 | 0.097712 | NaN | NaN | 2009 |
5 rows × 102 columns
Yep, it looks like there's some missing values.
是的,看起来好像有一些缺失值。
How many missing data points do we have?
我们有多少缺失值?
Ok, now we know that we do have some missing values. Let's see how many we have in each column.
好的,现在我们知道我们确实有一些缺失值。 让我们看看每列中有多少个。
# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()
# look at the # of missing points in the first ten columns
missing_values_count[0:10]
Date 0
GameID 0
Drive 0
qtr 0
down 61154
time 224
TimeUnder 0
TimeSecs 224
PlayTimeDiff 444
SideofField 528
dtype: int64
That seems like a lot! It might be helpful to see what percentage of the values in our dataset were missing to give us a better sense of the scale of this problem:
看起来好像很多啊! 查看数据集中缺失的值的百分比可能会有所帮助,以便我们更好地了解此问题的规模:
# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()
# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
print(percent_missing)
27.66722370547874
Wow, almost a quarter of the cells in this dataset are empty! In the next step, we're going to take a closer look at some of the columns with missing values and try to figure out what might be going on with them.
哇,这个数据集中几乎四分之一的单元格是空的! 在下一步中,我们将仔细查看一些缺少值的列,并尝试找出它们可能发生的情况。
Figure out why the data is missing
找出数据丢失的原因
This is the point at which we get into the part of data science that I like to call "data intution", by which I mean "really looking at your data and trying to figure out why it is the way it is and how that will affect your analysis". It can be a frustrating part of data science, especially if you're newer to the field and don't have a lot of experience. For dealing with missing values, you'll need to use your intution to figure out why the value is missing. One of the most important questions you can ask yourself to help figure this out is this:
这就是我们进入数据科学的部分,我喜欢称之为数据直觉
,我的意思是真正地查看你的数据并试图弄清楚为什么它是这样的以及它将如何影响你的分析
。 这可能是数据科学中令人沮丧的一部分,特别是如果您是该领域的新手并且没有太多经验。 为了处理缺失值,您需要使用直觉来找出该值缺失的原因。 为了帮助解决这个问题,您可以问自己最重要的问题之一是:
Is this value missing because it wasn't recorded or because it doesn't exist?
该值缺失是因为未记录还是因为它不存在?
If a value is missing becuase it doesn't exist (like the height of the oldest child of someone who doesn't have any children) then it doesn't make sense to try and guess what it might be. These values you probably do want to keep as NaN
. On the other hand, if a value is missing because it wasn't recorded, then you can try to guess what it might have been based on the other values in that column and row. This is called imputation, and we'll learn how to do it next! 🙂
如果某个值因为不存在而缺失(例如没有孩子的人中最大的孩子的身高),那么尝试猜测它可能是什么就没有意义了。 您可能确实希望将这些值保留为NaN
。 另一方面,如果某个值由于未记录而丢失,那么您可以尝试根据该列和行中的其他值来猜测它可能是什么。 这称为插补,接下来我们将学习如何做到这一点! 🙂
Let's work through an example. Looking at the number of missing values in the nfl_data
dataframe, I notice that the column "TimesSec" has a lot of missing values in it:
让我们来看一个例子。 查看nfl_data
dataframe 中缺失值的数量,我注意到TimesSec
列中有很多缺失值:
# look at the # of missing points in the first ten columns
missing_values_count[missing_values_count>0]
down 61154
time 224
TimeSecs 224
PlayTimeDiff 444
SideofField 528
...
Away_WP_post 26587
Win_Prob 25009
WPA 5541
airWPA 248501
yacWPA 248762
Length: 65, dtype: int64
By looking at the documentation, I can see that this column has information on the number of seconds left in the game when the play was made. This means that these values are probably missing because they were not recorded, rather than because they don't exist. So, it would make sense for us to try and guess what they should be rather than just leaving them as NA's.
通过查看文档,我可以看到此列包含有关比赛进行时剩余秒数的信息。 这意味着这些值可能丢失,因为它们没有被记录,而不是因为它们不存在。 因此,我们有意义的做法是尝试猜测它们应该是什么而不是仅仅将它们保留为 NA 。
On the other hand, there are other fields, like "PenalizedTeam" that also have lot of missing fields. In this case, though, the field is missing because if there was no penalty then it doesn't make sense to say which team was penalized. For this column, it would make more sense to either leave it empty or to add a third value like "neither" and use that to replace the NA's.
另一方面,还有其他字段,例如PenalizedTeam
,也有很多缺失字段。 但在这种情况下,该字段缺失,因为如果没有处罚,那么说哪个
球队受到处罚就没有意义。 对于此列,将其留空或添加第三个值(例如两者都不是
)并使用它来替换 NA 会更有意义。
Tip: This is a great place to read over the dataset documentation if you haven't already! If you're working with a dataset that you've gotten from another person, you can also try reaching out to them to get more information.
提示: 如果您还没有阅读数据集文档,这是一个阅读数据集文档的好地方! 如果您正在使用从其他人那里获得的数据集,您也可以尝试联系他们以获取更多信息。
If you're doing very careful data analysis, this is the point at which you'd look at each column individually to figure out the best strategy for filling those missing values. For the rest of this notebook, we'll cover some "quick and dirty" techniques that can help you with missing values but will probably also end up removing some useful information or adding some noise to your data.
如果您正在进行非常仔细的数据分析,此时您需要单独查看每一列,以找出填充这些缺失值的最佳策略。 在本笔记本的其余部分,我们将介绍一些快速而肮脏
的技术,这些技术可以帮助您处理缺失的值,但最终也可能会删除一些有用的信息或向数据中添加一些噪音。
Drop missing values
删除缺失值
If you're in a hurry or don't have a reason to figure out why your values are missing, one option you have is to just remove any rows or columns that contain missing values. (Note: I don't generally recommend this approch for important projects! It's usually worth it to take the time to go through your data and really look at all the columns with missing values one-by-one to really get to know your dataset.)
如果您很着急或没有理由找出值缺失的原因,您可以选择的一种选择是删除包含缺失值的任何行或列。 (注意:我一般不推荐对重要项目使用这种方法!花时间浏览数据并逐一查看所有缺失值的列以真正了解数据集通常是值得的 .)
If you're sure you want to drop rows with missing values, pandas does have a handy function, dropna()
to help you do this. Let's try it out on our NFL dataset!
如果你确定要删除缺少值的行,pandas 确实有一个方便的函数dropna()
来帮助你做到这一点。 让我们在 NFL 数据集上尝试一下!
# remove all the rows that contain a missing value
nfl_data.dropna()
Date | GameID | Drive | qtr | down | time | TimeUnder | TimeSecs | PlayTimeDiff | SideofField | ... | yacEPA | Home_WP_pre | Away_WP_pre | Home_WP_post | Away_WP_post | Win_Prob | WPA | airWPA | yacWPA | Season |
---|
0 rows × 102 columns
Oh dear, it looks like that's removed all our data! 😱 This is because every row in our dataset had at least one missing value. We might have better luck removing all the columns that have at least one missing value instead.
天哪,看起来我们所有的数据都被删除了! 😱 这是因为我们数据集中的每一行都至少有一个缺失值。 我们删除所有至少有一个缺失值的列
可能会更幸运些。
# remove all columns with at least one missing value
columns_with_na_dropped = nfl_data.dropna(axis=1)
columns_with_na_dropped.head()
Date | GameID | Drive | qtr | TimeUnder | ydstogo | ydsnet | PlayAttempted | Yards.Gained | sp | ... | AwayTeam | Timeout_Indicator | posteam_timeouts_pre | HomeTimeouts_Remaining_Pre | AwayTimeouts_Remaining_Pre | HomeTimeouts_Remaining_Post | AwayTimeouts_Remaining_Post | ExPoint_Prob | TwoPoint_Prob | Season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2009-09-10 | 2009091000 | 1 | 1 | 15 | 0 | 0 | 1 | 39 | 0 | ... | TEN | 0 | 3 | 3 | 3 | 3 | 3 | 0.0 | 0.0 | 2009 |
1 | 2009-09-10 | 2009091000 | 1 | 1 | 15 | 10 | 5 | 1 | 5 | 0 | ... | TEN | 0 | 3 | 3 | 3 | 3 | 3 | 0.0 | 0.0 | 2009 |
2 | 2009-09-10 | 2009091000 | 1 | 1 | 15 | 5 | 2 | 1 | -3 | 0 | ... | TEN | 0 | 3 | 3 | 3 | 3 | 3 | 0.0 | 0.0 | 2009 |
3 | 2009-09-10 | 2009091000 | 1 | 1 | 14 | 8 | 2 | 1 | 0 | 0 | ... | TEN | 0 | 3 | 3 | 3 | 3 | 3 | 0.0 | 0.0 | 2009 |
4 | 2009-09-10 | 2009091000 | 1 | 1 | 14 | 8 | 2 | 1 | 0 | 0 | ... | TEN | 0 | 3 | 3 | 3 | 3 | 3 | 0.0 | 0.0 | 2009 |
5 rows × 37 columns
# just how much data did we lose?
print("Columns in original dataset: %d \n" % nfl_data.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])
Columns in original dataset: 102
Columns with na's dropped: 37
We've lost quite a bit of data, but at this point we have successfully removed all the NaN
's from our data.
我们丢失了相当多的数据,但此时我们已经成功地从数据中删除了所有NaN
。
Filling in missing values automatically
自动填充缺失值
Another option is to try and fill in the missing values. For this next bit, I'm getting a small sub-section of the NFL data so that it will print well.
另一种选择是尝试填写缺失的值。 对于接下来的部分,我将获取 NFL 数据的一小部分,以便它可以很好地打印。
# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()
subset_nfl_data
EPA | airEPA | yacEPA | Home_WP_pre | Away_WP_pre | Home_WP_post | Away_WP_post | Win_Prob | WPA | airWPA | yacWPA | Season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.014474 | NaN | NaN | 0.485675 | 0.514325 | 0.546433 | 0.453567 | 0.485675 | 0.060758 | NaN | NaN | 2009 |
1 | 0.077907 | -1.068169 | 1.146076 | 0.546433 | 0.453567 | 0.551088 | 0.448912 | 0.546433 | 0.004655 | -0.032244 | 0.036899 | 2009 |
2 | -1.402760 | NaN | NaN | 0.551088 | 0.448912 | 0.510793 | 0.489207 | 0.551088 | -0.040295 | NaN | NaN | 2009 |
3 | -1.712583 | 3.318841 | -5.031425 | 0.510793 | 0.489207 | 0.461217 | 0.538783 | 0.510793 | -0.049576 | 0.106663 | -0.156239 | 2009 |
4 | 2.097796 | NaN | NaN | 0.461217 | 0.538783 | 0.558929 | 0.441071 | 0.461217 | 0.097712 | NaN | NaN | 2009 |
We can use the Panda's fillna()
function to fill in missing values in a dataframe for us. One option we have is to specify what we want the NaN
values to be replaced with. Here, I'm saying that I would like to replace all the NaN
values with 0.
我们可以使用 Pandas 的 fillna()
函数来填充dataframe中的缺失值。 我们的一种选择是指定我们想要用什么来替换NaN
值。 在这里,我想说的是我想将所有NaN
值替换为 0。
# replace all NA's with 0
subset_nfl_data.fillna(0)
EPA | airEPA | yacEPA | Home_WP_pre | Away_WP_pre | Home_WP_post | Away_WP_post | Win_Prob | WPA | airWPA | yacWPA | Season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.014474 | 0.000000 | 0.000000 | 0.485675 | 0.514325 | 0.546433 | 0.453567 | 0.485675 | 0.060758 | 0.000000 | 0.000000 | 2009 |
1 | 0.077907 | -1.068169 | 1.146076 | 0.546433 | 0.453567 | 0.551088 | 0.448912 | 0.546433 | 0.004655 | -0.032244 | 0.036899 | 2009 |
2 | -1.402760 | 0.000000 | 0.000000 | 0.551088 | 0.448912 | 0.510793 | 0.489207 | 0.551088 | -0.040295 | 0.000000 | 0.000000 | 2009 |
3 | -1.712583 | 3.318841 | -5.031425 | 0.510793 | 0.489207 | 0.461217 | 0.538783 | 0.510793 | -0.049576 | 0.106663 | -0.156239 | 2009 |
4 | 2.097796 | 0.000000 | 0.000000 | 0.461217 | 0.538783 | 0.558929 | 0.441071 | 0.461217 | 0.097712 | 0.000000 | 0.000000 | 2009 |
I could also be a bit more savvy and replace missing values with whatever value comes directly after it in the same column. (This makes a lot of sense for datasets where the observations have some sort of logical order to them.)
我还可以更精明一点,用同一列中紧随其后的任何值替换缺失值。 (这对于观察结果具有某种逻辑顺序的数据集来说很有意义。)
# replace all NA's the value that comes directly after it in the same column,
# then replace all the remaining na's with 0
# subset_nfl_data.fillna(method='bfill', axis=0).fillna(0)
subset_nfl_data.bfill(axis=0).fillna(0)
EPA | airEPA | yacEPA | Home_WP_pre | Away_WP_pre | Home_WP_post | Away_WP_post | Win_Prob | WPA | airWPA | yacWPA | Season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.014474 | -1.068169 | 1.146076 | 0.485675 | 0.514325 | 0.546433 | 0.453567 | 0.485675 | 0.060758 | -0.032244 | 0.036899 | 2009 |
1 | 0.077907 | -1.068169 | 1.146076 | 0.546433 | 0.453567 | 0.551088 | 0.448912 | 0.546433 | 0.004655 | -0.032244 | 0.036899 | 2009 |
2 | -1.402760 | 3.318841 | -5.031425 | 0.551088 | 0.448912 | 0.510793 | 0.489207 | 0.551088 | -0.040295 | 0.106663 | -0.156239 | 2009 |
3 | -1.712583 | 3.318841 | -5.031425 | 0.510793 | 0.489207 | 0.461217 | 0.538783 | 0.510793 | -0.049576 | 0.106663 | -0.156239 | 2009 |
4 | 2.097796 | 0.000000 | 0.000000 | 0.461217 | 0.538783 | 0.558929 | 0.441071 | 0.461217 | 0.097712 | 0.000000 | 0.000000 | 2009 |
Your turn
到你了
Write your own code to deal with missing values in a dataset of building permits issued in San Francisco.
在旧金山颁发的建筑许可证数据集中编写您自己的代码来处理缺失值。