Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

04.course-grouping-and-sorting【分组及排序】

Introduction

介绍

Maps allow us to transform data in a DataFrame or Series one value at a time for an entire column. However, often we want to group our data, and then do something specific to the group the data is in.

映射允许我们将 DataFrame 或 Series 中的数据一次转换一整列的值。 但是,我们通常希望对数据进行分组,然后针对数据所在的组执行特定的操作。

As you'll learn, we do this with the groupby() operation. We'll also cover some additional topics, such as more complex ways to index your DataFrames, along with how to sort your data.

正如您将了解到的,我们通过groupby()操作来完成此操作。 我们还将介绍一些其他主题,例如索引 DataFrame 的更复杂方法,以及如何对数据进行排序。

To start the exercise for this topic, please click here.

要开始本主题的练习,请单击此处

Groupwise analysis

分组分析

One function we've been using heavily thus far is the value_counts() function. We can replicate what value_counts() does by doing the following:

到目前为止,我们经常使用的一个函数是value_counts()函数。 我们可以通过执行以下操作来达到value_counts()相同的效果:


import pandas as pd
reviews = pd.read_csv("../00 datasets/zynicide/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)
reviews.groupby('points').points.count()
points
80     397
81     692
      ... 
99      33
100     19
Name: points, Length: 21, dtype: int64

groupby() created a group of reviews which allotted the same point values to the given wines. Then, for each of these groups, we grabbed the points() column and counted how many times it appeared. value_counts() is just a shortcut to this groupby() operation.

groupby() 创建了一个分组,为给定的葡萄酒分配了相同的分值。 然后,对于每个组,我们抓取points()列并计算它出现的次数。 value_counts() 只是这个 groupby() 操作的快捷方式。

We can use any of the summary functions we've used before with this data. For example, to get the cheapest wine in each point value category, we can do the following:

我们可以对这些数据使用之前使用过的任何汇总函数。 例如,要获得每个点值类别中最便宜的葡萄酒,我们可以执行以下操作:

reviews.groupby('points').price.min()
points
80      5.0
81      5.0
       ... 
99     44.0
100    80.0
Name: price, Length: 21, dtype: float64

You can think of each group we generate as being a slice of our DataFrame containing only data with values that match. This DataFrame is accessible to us directly using the apply() method, and we can then manipulate the data in any way we see fit. For example, here's one way of selecting the name of the first wine reviewed from each winery in the dataset:

您可以将我们生成的每个组视为 DataFrame 的一部分,仅包含具有匹配值的数据。 我们可以直接使用apply()方法访问此 DataFrame,然后我们可以以任何我们认为合适的方式操作数据。 例如,以下是选择数据集中每个酒厂评论的第一款葡萄酒名称的方法:

reviews.groupby('winery').apply(lambda df: df.title.iloc[0])
/tmp/ipykernel_21643/506966275.py:1: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass include_groups=False to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  reviews.groupby('winery').apply(lambda df: df.title.iloc[0])

winery
1+1=3                          1+1=3 NV Rosé Sparkling (Cava)
10 Knots                 10 Knots 2010 Viognier (Paso Robles)
                                  ...                        
àMaurice    àMaurice 2013 Fred Estate Syrah (Walla Walla V...
Štoka                         Štoka 2009 Izbrani Teran (Kras)
Length: 16757, dtype: object

For even more fine-grained control, you can also group by more than one column. For an example, here's how we would pick out the best wine by country and province:

为了进行更细粒度的控制,您还可以按多列进行分组。 举个例子,以下是我们如何按国家和省份挑选最好的葡萄酒的方法:

reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])
/tmp/ipykernel_21643/1865732994.py:1: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass include_groups=False to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
country province
Argentina Mendoza Province Argentina If the color doesn't tell the full story, the ... Nicasia Vineyard 97 120.0 Mendoza Province Mendoza NaN Michael Schachner @wineschach Bodega Catena Zapata 2006 Nicasia Vineyard Mal... Malbec Bodega Catena Zapata
Other Argentina Take note, this could be the best wine Colomé ... Reserva 95 90.0 Other Salta NaN Michael Schachner @wineschach Colomé 2010 Reserva Malbec (Salta) Malbec Colomé
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Uruguay San Jose Uruguay Baked, sweet, heavy aromas turn earthy with ti... El Preciado Gran Reserva 87 50.0 San Jose NaN NaN Michael Schachner @wineschach Castillo Viejo 2005 El Preciado Gran Reserva R... Red Blend Castillo Viejo
Uruguay Uruguay Cherry and berry aromas are ripe, healthy and ... Blend 002 Limited Edition 91 22.0 Uruguay NaN NaN Michael Schachner @wineschach Narbona NV Blend 002 Limited Edition Tannat-Ca... Tannat-Cabernet Franc Narbona

425 rows × 13 columns

Another groupby() method worth mentioning is agg(), which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:

另一个值得一提的groupby()方法是agg(),它允许您同时在 DataFrame 上运行一堆不同的函数。 例如,我们可以生成数据集的简单统计摘要,如下所示:

# reviews.groupby(['country']).price.agg([len, min, max])
reviews.groupby(['country']).price.agg([len, 'min', 'max'])
len min max
country
Argentina 3800 4.0 230.0
Armenia 2 14.0 15.0
... ... ... ...
Ukraine 14 6.0 13.0
Uruguay 109 10.0 130.0

43 rows × 3 columns

Effective use of groupby() will allow you to do lots of really powerful things with your dataset.

有效使用groupby()将使您能够利用数据集做很多真正强大的事情。

Multi-indexes

多索引

In all of the examples we've seen thus far we've been working with DataFrame or Series objects with a single-label index. groupby() is slightly different in the fact that, depending on the operation we run, it will sometimes result in what is called a multi-index.

到目前为止,在我们看到的所有示例中,我们一直在使用具有单标签索引的 DataFrame 或 Series 对象。 groupby() 略有不同,因为根据我们运行的操作,它有时会导致所谓的多索引。

A multi-index differs from a regular index in that it has multiple levels. For example:

多索引与常规索引的不同之处在于它具有多个级别。 例如:

countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed
len
country province
Argentina Mendoza Province 3264
Other 536
... ... ...
Uruguay San Jose 3
Uruguay 24

425 rows × 1 columns

mi = countries_reviewed.index
type(mi)
pandas.core.indexes.multi.MultiIndex

Multi-indices have several methods for dealing with their tiered structure which are absent for single-level indices. They also require two levels of labels to retrieve a value. Dealing with multi-index output is a common "gotcha" for users new to pandas.

多索引有多种处理其分层结构的方法,而单层索引则没有这些方法。 它们还需要两层标签来检索值。 对于刚接触 pandas 的用户来说,处理多索引输出是一个常见的问题

The use cases for a multi-index are detailed alongside instructions on using them in the MultiIndex / Advanced Selection section of the pandas documentation.

pandas 文档的 MultiIndex / Advanced Selection 部分详细介绍了多索引的用例以及使用说明。

However, in general the multi-index method you will use most often is the one for converting back to a regular index, the reset_index() method:

然而,一般来说,您最常使用的多索引方法是转换回常规索引的方法,即reset_index()方法:

countries_reviewed.reset_index()
country province len
0 Argentina Mendoza Province 3264
1 Argentina Other 536
... ... ... ...
423 Uruguay San Jose 3
424 Uruguay Uruguay 24

425 rows × 3 columns

Sorting

排序

Looking again at countries_reviewed we can see that grouping returns data in index order, not in value order. That is to say, when outputting the result of a groupby, the order of the rows is dependent on the values in the index, not in the data.

再次查看countries_reviewed,我们可以看到分组按索引顺序返回数据,而不是按值顺序。 也就是说,当输出groupby的结果时,行的顺序取决于索引中的值,而不是数据中的值。

To get data in the order want it in we can sort it ourselves. The sort_values() method is handy for this.

为了按照想要的顺序获取数据,我们可以自己对其进行排序。 sort_values() 方法对此很方便。

countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')
country province len
179 Greece Muscat of Kefallonian 1
192 Greece Sterea Ellada 1
... ... ... ...
415 US Washington 8639
392 US California 36247

425 rows × 3 columns

sort_values() defaults to an ascending sort, where the lowest values go first. However, most of the time we want a descending sort, where the higher numbers go first. That goes thusly:

sort_values() 默认为升序排序,最低的值排在前面。 然而,大多数时候我们想要降序排序,即数字较大的排在前面。 可以这样操作:

countries_reviewed.sort_values(by='len', ascending=False)
country province len
392 US California 36247
415 US Washington 8639
... ... ... ...
63 Chile Coelemu 1
149 Greece Beotia 1

425 rows × 3 columns

To sort by index values, use the companion method sort_index(). This method has the same arguments and default order:

要按索引值排序,请使用配套方法sort_index()。 此方法具有相同的参数和默认顺序:

countries_reviewed.sort_index()
country province len
0 Argentina Mendoza Province 3264
1 Argentina Other 536
... ... ... ...
423 Uruguay San Jose 3
424 Uruguay Uruguay 24

425 rows × 3 columns

Finally, know that you can sort by more than one column at a time:

最后,要知道您一次可以按多个列进行排序:

countries_reviewed.sort_values(by=['country', 'len'] ,ascending=[False, True])
country province len
423 Uruguay San Jose 3
418 Uruguay Atlantida 5
... ... ... ...
1 Argentina Other 536
0 Argentina Mendoza Province 3264

425 rows × 3 columns

Your turn

到你了

If you haven't started the exercise, you can get started here.

如果您还没有开始练习,可以从这里开始

04.course-grouping-and-sorting【分组及排序】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top