Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

04.exercise-grouping-and-sorting【练习:分组与排序】

This notebook is an exercise in the Pandas course. You can reference the tutorial at this link.


Introduction

介绍

In these exercises we'll apply groupwise analysis to our dataset.

在这些练习中,我们将对数据集应用分组分析。

Run the code cell below to load the data before running the exercises.

在运行练习之前,运行下面的代码单元以加载数据。

import pandas as pd

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.grouping_and_sorting import *
print("Setup complete.")
Setup complete.

/opt/conda/lib/python3.10/site-packages/learntools/pandas/grouping_and_sorting.py:29: FutureWarning: The provided callable  is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  _expected = reviews.groupby('variety').price.agg([min, max])
/opt/conda/lib/python3.10/site-packages/learntools/pandas/grouping_and_sorting.py:29: FutureWarning: The provided callable  is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  _expected = reviews.groupby('variety').price.agg([min, max])
/opt/conda/lib/python3.10/site-packages/learntools/pandas/grouping_and_sorting.py:35: FutureWarning: The provided callable  is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  _expected = reviews.groupby('variety').price.agg([min, max]).sort_values(by=['min', 'max'], ascending=False)
/opt/conda/lib/python3.10/site-packages/learntools/pandas/grouping_and_sorting.py:35: FutureWarning: The provided callable  is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  _expected = reviews.groupby('variety').price.agg([min, max]).sort_values(by=['min', 'max'], ascending=False)

Exercises

练习

1.

Who are the most common wine reviewers in the dataset? Create a Series whose index is the taster_twitter_handle category from the dataset, and whose values count how many reviews each person wrote.

数据集中最常见的葡萄酒评论家是谁? 创建一个Series,其索引是数据集中的taster_twitter_handle类别,其值计算每个人撰写的评论数量。

# Your code here
#reviews_written = ____
reviews_written = reviews.groupby(['taster_twitter_handle'])['taster_twitter_handle'].count()

# Check your answer
q1.check()

reviews_written

Correct:

reviews_written = reviews.groupby('taster_twitter_handle').size()

or

reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()
taster_twitter_handle
@AnneInVino        3685
@JoeCz             5147
                   ... 
@winewchristina       6
@worldwineguys     1005
Name: taster_twitter_handle, Length: 15, dtype: int64
# q1.hint()
q1.solution()

Solution:

reviews_written = reviews.groupby('taster_twitter_handle').size()

or

reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()

2.

What is the best wine I can buy for a given amount of money? Create a Series whose index is wine prices and whose values is the maximum number of points a wine costing that much was given in a review. Sort the values by price, ascending (so that 4.0 dollars is at the top and 3300.0 dollars is at the bottom).

在给定的金额下我能买到的最好的葡萄酒是什么? 创建一个系列,其索引是葡萄酒价格,其值是评论中给出的价格相同的葡萄酒的最大分数。 按价格升序对值进行排序(4.0美元位于顶部,3300.0美元位于底部)。

#best_rating_per_price = ____
best_rating_per_price = reviews.groupby(['price'])['points'].max().sort_index()

# Check your answer
q2.check()

best_rating_per_price

Correct

price
4.0       86
5.0       87
          ..
2500.0    96
3300.0    88
Name: points, Length: 390, dtype: int64
# q2.hint()
q2.solution()

Solution:

best_rating_per_price = reviews.groupby('price')['points'].max().sort_index()

3.

What are the minimum and maximum prices for each variety of wine? Create a DataFrame whose index is the variety category from the dataset and whose values are the min and max values thereof.

每种品种葡萄酒的最低和最高价格是多少? 创建一个DataFrame,其索引是数据集中的variety类别,其值是其minmax值。

#price_extremes = ____
price_extremes = reviews.groupby(['variety'])['price'].agg(['min','max'])
# Check your answer
q3.check()
price_extremes

Correct

min max
variety
Abouriou 15.0 75.0
Agiorgitiko 10.0 66.0
... ... ...
Çalkarası 19.0 19.0
Žilavka 15.0 15.0

707 rows × 2 columns

# q3.hint()
q3.solution()

Solution:

price_extremes = reviews.groupby('variety').price.agg([min, max])

4.

What are the most expensive wine varieties? Create a variable sorted_varieties containing a copy of the dataframe from the previous question where varieties are sorted in descending order based on minimum price, then on maximum price (to break ties).

最昂贵的葡萄酒品种有哪些? 创建一个变量sorted_varieties,其中包含上一个问题的DataFrame的副本,其中品种根据最低价格按降序排序,然后根据最高价格(以打破平局)。

#sorted_varieties = ____
sorted_varieties = price_extremes.sort_values(['min', 'max'], ascending=[False, False])
# Check your answer
q4.check()

sorted_varieties

Correct

min max
variety
Ramisco 495.0 495.0
Terrantez 236.0 236.0
... ... ...
Vital NaN NaN
Zelen NaN NaN

707 rows × 2 columns

# q4.hint()
q4.solution()

Solution:

sorted_varieties = price_extremes.sort_values(by=['min', 'max'], ascending=False)

5.

Create a Series whose index is reviewers and whose values is the average review score given out by that reviewer. Hint: you will need the taster_name and points columns.

创建一个Series,其索引为审阅者,其值为该审阅者给出的平均审阅分数。 提示:您将需要taster_namepoints列。

#reviewer_mean_ratings = ____
reviewer_mean_ratings = reviews.groupby(['taster_name'])['points'].mean()
# Check your answer
q5.check()

reviewer_mean_ratings

Correct

taster_name
Alexander Peartree    85.855422
Anna Lee C. Iijima    88.415629
                        ...    
Susan Kostrzewa       86.609217
Virginie Boone        89.213379
Name: points, Length: 19, dtype: float64
#q5.hint()
q5.solution()

Solution:

reviewer_mean_ratings = reviews.groupby('taster_name').points.mean()

Are there significant differences in the average scores assigned by the various reviewers? Run the cell below to use the describe() method to see a summary of the range of values.

不同评审者给出的平均分数是否存在显着差异? 运行下面的单元格以使用describe()方法来查看值范围的摘要。

reviewer_mean_ratings.describe()
count    19.000000
mean     88.233026
           ...    
75%      88.975256
max      90.562551
Name: points, Length: 8, dtype: float64

6.

What combination of countries and varieties are most common? Create a Series whose index is a MultiIndexof {country, variety} pairs. For example, a pinot noir produced in the US should map to {"US", "Pinot Noir"}. Sort the values in the Series in descending order based on wine count.

哪种国家和品种的组合最常见? 创建一个Series,其索引是{country,varie}对的MultiIndex。 例如,美国生产的黑皮诺应映射到{"US", "Pinot Noir"}。 根据酒数按降序对系列中的值进行排序。

#country_variety_counts = ____
country_variety_counts = reviews.groupby(['country', 'variety'])['country'].count().sort_values(ascending=False)
# Check your answer
q6.check()

country_variety_counts

Correct

country  variety           
US       Pinot Noir            9885
         Cabernet Sauvignon    7315
                               ... 
Mexico   Rosado                   1
Uruguay  White Blend              1
Name: country, Length: 1612, dtype: int64
# q6.hint()
q6.solution()

Solution:

country_variety_counts = reviews.groupby(['country', 'variety']).size().sort_values(ascending=False)

Keep going

继续前进

Move on to the data types and missing data.

继续讨论数据类型和缺失数据

04.exercise-grouping-and-sorting【练习:分组与排序】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top