Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

03.course-summary-functions-and-maps【摘要函数及映射】

Introduction

介绍

In the last tutorial, we learned how to select relevant data out of a DataFrame or Series. Plucking the right data out of our data representation is critical to getting work done, as we demonstrated in the exercises.

在上一个教程中,我们学习了如何从 DataFrame 或 Series 中选择相关数据。 正如我们在练习中所演示的那样,从数据表示中提取正确的数据对于完成工作至关重要。

However, the data does not always come out of memory in the format we want it in right out of the bat. Sometimes we have to do some more work ourselves to reformat it for the task at hand. This tutorial will cover different operations we can apply to our data to get the input "just right".

然而,数据并不总是以我们想要的格式直接从内存中出来。 有时我们必须自己做更多的工作来重新格式化它以适应手头的任务。 本教程将介绍我们可以应用于数据以获得“恰到好处”的输入的不同操作。

To start the exercise for this topic, please click here.

要开始本主题的练习,请单击此处

We'll use the Wine Magazine data for demonstration.

我们将使用《葡萄酒杂志》的数据进行演示。


import pandas as pd
pd.set_option('display.max_rows', 5)
import numpy as np
reviews = pd.read_csv("../00 datasets/zynicide/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
reviews
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

Summary functions

摘要函数

Pandas 提供了许多简单的“摘要函数”(不是官方名称),它们以某种有用的方式重组数据。 例如,考虑describe()方法:

Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way. For example, consider the describe() method:

reviews.points.describe()
count    129971.000000
mean         88.447138
             ...      
75%          91.000000
max         100.000000
Name: points, Length: 8, dtype: float64

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:

此方法生成给定列属性的高级摘要。 它是类型相关的,这意味着它的输出根据输入的数据类型而变化。 上面的输出仅对数值数据有意义; 对于字符串数据,我们得到的是:

reviews.taster_name.describe()
count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas function that makes it happen.

如果您想获得有关 DataFrame 或 Series 中的列的一些特定的简单汇总统计信息,通常有一个有用的 pandas 函数可以实现这一点。

For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the mean() function:

例如,要查看分数的平均值(例如,平均评分的葡萄酒的表现如何),我们可以使用mean()函数:

reviews.points.mean()
88.44713820775404

To see a list of unique values we can use the unique() function:

要查看唯一值的列表,我们可以使用unique()函数:

reviews.taster_name.unique()
array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

To see a list of unique values and how often they occur in the dataset, we can use the value_counts() method:

要查看唯一值的列表以及它们在数据集中出现的频率,我们可以使用value_counts()方法:

reviews.taster_name.value_counts()
taster_name
Roger Voss           25514
Michael Schachner    15134
                     ...  
Fiona Adams             27
Christina Pickard        6
Name: count, Length: 19, dtype: int64

Maps

映射

A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

映射 是一个借用自数学的术语,指的是采用一组值并将它们“映射”到另一组值的函数。 在数据科学中,我们经常需要从现有数据创建新的表示,或者将数据从现在的格式转换为后期我们所希望的格式。 映射是处理这项工作的工具,因此它们对于完成您的工作极其重要!

There are two mapping methods that you will use often.

您将经常使用两种映射方法。

map() is the first, and slightly simpler one. For example, suppose that we wanted to remean the scores the wines received to 0. We can do this as follows:

map() 是第一个,稍微简单一些。 例如,假设我们想要将葡萄酒的得分重新以0为均值进行构造。我们可以按如下方式执行此操作:

review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean)
0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

The function you pass to map() should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. map() returns a new Series where all the values have been transformed by your function.

您传递给map()的函数应该期望来自 Series 的单个值(在上面的示例中为点值),并返回该值的转换版本。 map() 返回一个新的 Series,其中所有值都已由您的函数转换。

apply() is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

如果我们想通过调用自定义方法来转换整个 DataFrame,apply() 是按行进行应用的等效方法。

def remean_points(row):
    row.points = row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco -1.447138 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos -1.447138 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 1.552862 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 1.552862 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

If we had called reviews.apply() with axis='index', then instead of passing a function to transform each row, we would need to give a function to transform each column.

如果我们用axis='index'来调用reviews.apply(),那么我们不需要传递一个函数来转换每一行,而是需要提供一个函数来转换每个

Note that map() and apply() return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on. If we look at the first row of reviews, we can see that it still has its original points value.

请注意,map()apply()分别返回新的、转换后的 Series 和 DataFrame。 他们不会修改他们所调用的原始数据。 如果我们查看第一行reviews,我们可以看到它仍然具有原来的points值。

reviews.head(1)
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia

Pandas provides many common mapping operations as built-ins. For example, here's a faster way of remeaning our points column:

Pandas 提供了许多常见的内置映射操作。 例如,以下是重新均值分配“分数”列的更快方法:

review_points_mean = reviews.points.mean()
reviews.points - review_points_mean
0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

In this code we are performing an operation between a lot of values on the left-hand side (everything in the Series) and a single value on the right-hand side (the mean value). Pandas looks at this expression and figures out that we must mean to subtract that mean value from every value in the dataset.

在此代码中,我们在左侧的多个值(系列中的所有值)和右侧的单个值(平均值)之间执行运算。 Pandas 查看这个表达式并发现我们必须从数据集中的每个值中减去该平均值。

Pandas will also understand what to do if we perform these operations between Series of equal length. For example, an easy way of combining country and region information in the dataset would be to do the following:

如果我们在相等长度的 Series 之间执行这些操作,Pandas 也会明白要做什么。 例如,以下操作是一种简单的组合数据集中的国家和地区信息的方法:

reviews.country + " - " + reviews.region_1
0            Italy - Etna
1                     NaN
               ...       
129969    France - Alsace
129970    France - Alsace
Length: 129971, dtype: object

These operators are faster than map() or apply() because they use speed ups built into pandas. All of the standard Python operators (>, <, ==, and so on) work in this manner.

这些运算符比map()apply()更快,因为它们使用 pandas 内置的加速功能。 所有标准 Python 运算符(><== 等)都以这种方式工作。

However, they are not as flexible as map() or apply(), which can do more advanced things, like applying conditional logic, which cannot be done with addition and subtraction alone.

然而,它们不像map()apply()那样灵活,后者可以做更高级的事情,比如应用条件逻辑,这是单独用加法和减法无法完成的。

Your turn

到你了

If you haven't started the exercise, you can get started here.

如果您还没有开始练习,可以从这里开始

03.course-summary-functions-and-maps【摘要函数及映射】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top