Introduction
介绍
In this tutorial, you'll learn how to investigate data types within a DataFrame or Series. You'll also learn how to find and replace entries.
在本教程中,您将学习如何研究 DataFrame 或 Series 中的数据类型。 您还将学习如何查找和替换条目。
To start the exercise for this topic, please click here.
要开始本主题的练习,请单击此处。
Dtypes
数据类型
The data type for a column in a DataFrame or a Series is known as the dtype.
DataFrame 或 Series 中列的数据类型称为 dtype。
You can use the dtype
property to grab the type of a specific column. For instance, we can get the dtype of the price
column in the reviews
DataFrame:
您可以使用dtype
属性来获取特定列的类型。 例如,我们可以获取reviews
DataFrame 中price
列的数据类型:
import pandas as pd
reviews = pd.read_csv("../00 datasets/zynicide/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option('display.max_rows', 5)
reviews.price.dtype
dtype('float64')
Alternatively, the dtypes
property returns the dtype
of every column in the DataFrame:
或者,dtypes
属性返回 DataFrame 中 所有 列的dtype
:
reviews.dtypes
country object
description object
...
variety object
winery object
Length: 13, dtype: object
Data types tell us something about how pandas is storing the data internally. float64
means that it's using a 64-bit floating point number; int64
means a similarly sized integer instead, and so on.
数据类型告诉我们 pandas 如何在内部存储数据。 float64
表示它使用 64 位浮点数; int64
表示类似大小的整数,依此类推。
One peculiarity to keep in mind (and on display very clearly here) is that columns consisting entirely of strings do not get their own type; they are instead given the object
type.
需要记住的一个特点(在这里非常清楚地显示)是完全由字符串组成的列没有自己的类型; 它们被赋予object
类型。
It's possible to convert a column of one type into another wherever such a conversion makes sense by using the astype()
function. For example, we may transform the points
column from its existing int64
data type into a float64
data type:
可以通过使用 astype()
函数将一种类型的列转换为另一种类型的列,只要这种转换有意义。 例如,我们可以将points
列从现有的int64
数据类型转换为float64
数据类型:
reviews.points.astype('float64')
0 87.0
1 87.0
...
129969 90.0
129970 90.0
Name: points, Length: 129971, dtype: float64
A DataFrame or Series index has its own dtype
, too:
DataFrame 或 Series 索引也有自己的dtype
:
reviews.index.dtype
dtype('int64')
Pandas also supports more exotic data types, such as categorical data and timeseries data. Because these data types are more rarely used, we will omit them until a much later section of this tutorial.
Pandas 还支持更奇特的数据类型,例如分类数据和时间序列数据。 由于这些数据类型很少使用,因此我们将在本教程的后面部分中省略它们。
Missing data
缺失值
Entries missing values are given the value NaN
, short for "Not a Number". For technical reasons these NaN
values are always of the float64
dtype.
缺失值的条目被赋予值NaN
,“Not a Number”的缩写。 由于技术原因,这些NaN
值始终是float64
数据类型。
Pandas provides some methods specific to missing data. To select NaN
entries you can use pd.isnull()
(or its companion pd.notnull()
). This is meant to be used thusly:
Pandas 提供了一些特定于缺失数据的方法。 要选择NaN
条目,您可以使用pd.isnull()
(或其同伴pd.notnull()
)。 这意味着这样使用:
reviews[pd.isnull(reviews.country)]
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
913 | NaN | Amber in color, this wine has aromas of peach ... | Asureti Valley | 87 | 30.0 | NaN | NaN | NaN | Mike DeSimone | @worldwineguys | Gotsa Family Wines 2014 Asureti Valley Chinuri | Chinuri | Gotsa Family Wines |
3131 | NaN | Soft, fruity and juicy, this is a pleasant, si... | Partager | 83 | NaN | NaN | NaN | NaN | Roger Voss | @vossroger | Barton & Guestier NV Partager Red | Red Blend | Barton & Guestier |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129590 | NaN | A blend of 60% Syrah, 30% Cabernet Sauvignon a... | Shah | 90 | 30.0 | NaN | NaN | NaN | Mike DeSimone | @worldwineguys | Büyülübağ 2012 Shah Red | Red Blend | Büyülübağ |
129900 | NaN | This wine offers a delightful bouquet of black... | NaN | 91 | 32.0 | NaN | NaN | NaN | Mike DeSimone | @worldwineguys | Psagot 2014 Merlot | Merlot | Psagot |
63 rows × 13 columns
Replacing missing values is a common operation. Pandas provides a really handy method for this problem: fillna()
. fillna()
provides a few different strategies for mitigating such data. For example, we can simply replace each NaN
with an "Unknown"
:
替换缺失值是一种常见操作。 Pandas 为这个问题提供了一个非常方便的方法:fillna()
。 fillna()
提供了几种不同的策略来减少此类数据。 例如,我们可以简单地将每个NaN
替换为"Unknown"
:
reviews.region_2.fillna("Unknown")
0 Unknown
1 Unknown
...
129969 Unknown
129970 Unknown
Name: region_2, Length: 129971, dtype: object
Or we could fill each missing value with the first non-null value that appears sometime after the given record in the database. This is known as the backfill strategy.
或者,我们可以使用数据库中给定记录之后某个时间出现的第一个非空值来填充每个缺失值。 这称为回填策略。
Alternatively, we may have a non-null value that we would like to replace. For example, suppose that since this dataset was published, reviewer Kerin O'Keefe has changed her Twitter handle from @kerinokeefe
to @kerino
. One way to reflect this in the dataset is using the replace()
method:
或者,我们可能想要替换一个非空值。 例如,假设自该数据集发布以来,审阅者 Kerin O'Keefe 已将她的 Twitter 句柄从@kerinokeefe
更改为@kerino
。 在数据集中实现这一点的一种方法是使用replace()
方法:
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")
0 @kerino
1 @vossroger
...
129969 @vossroger
129970 @vossroger
Name: taster_twitter_handle, Length: 129971, dtype: object
The replace()
method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like "Unknown"
, "Undisclosed"
, "Invalid"
, and so on.
这里值得一提的是 replace()
方法,因为它可以方便地替换丢失的数据,这些数据在数据集中被赋予了某种哨兵值:比如 "Unknown"
、"Undisclosure"
、"Invalid"
、 等等。
Your turn
到你了
If you haven't started the exercise, you can get started here.
如果您还没有开始练习,可以从这里开始。