This notebook is an exercise in the Pandas course. You can reference the tutorial at this link.
Introduction
介绍
In these exercises we'll apply groupwise analysis to our dataset.
在这些练习中,我们将对数据集应用分组分析。
Run the code cell below to load the data before running the exercises.
在运行练习之前,运行下面的代码单元以加载数据。
import pandas as pd
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)
from learntools.core import binder; binder.bind(globals())
from learntools.pandas.grouping_and_sorting import *
print("Setup complete.")
Setup complete.
/opt/conda/lib/python3.10/site-packages/learntools/pandas/grouping_and_sorting.py:29: FutureWarning: The provided callable is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
_expected = reviews.groupby('variety').price.agg([min, max])
/opt/conda/lib/python3.10/site-packages/learntools/pandas/grouping_and_sorting.py:29: FutureWarning: The provided callable is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
_expected = reviews.groupby('variety').price.agg([min, max])
/opt/conda/lib/python3.10/site-packages/learntools/pandas/grouping_and_sorting.py:35: FutureWarning: The provided callable is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
_expected = reviews.groupby('variety').price.agg([min, max]).sort_values(by=['min', 'max'], ascending=False)
/opt/conda/lib/python3.10/site-packages/learntools/pandas/grouping_and_sorting.py:35: FutureWarning: The provided callable is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
_expected = reviews.groupby('variety').price.agg([min, max]).sort_values(by=['min', 'max'], ascending=False)
Exercises
练习
1.
Who are the most common wine reviewers in the dataset? Create a Series
whose index is the taster_twitter_handle
category from the dataset, and whose values count how many reviews each person wrote.
数据集中最常见的葡萄酒评论家是谁? 创建一个Series
,其索引是数据集中的taster_twitter_handle
类别,其值计算每个人撰写的评论数量。
# Your code here
#reviews_written = ____
reviews_written = reviews.groupby(['taster_twitter_handle'])['taster_twitter_handle'].count()
# Check your answer
q1.check()
reviews_written
Correct:
reviews_written = reviews.groupby('taster_twitter_handle').size()
or
reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()
taster_twitter_handle
@AnneInVino 3685
@JoeCz 5147
...
@winewchristina 6
@worldwineguys 1005
Name: taster_twitter_handle, Length: 15, dtype: int64
# q1.hint()
q1.solution()
Solution:
reviews_written = reviews.groupby('taster_twitter_handle').size()
or
reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()
2.
What is the best wine I can buy for a given amount of money? Create a Series
whose index is wine prices and whose values is the maximum number of points a wine costing that much was given in a review. Sort the values by price, ascending (so that 4.0
dollars is at the top and 3300.0
dollars is at the bottom).
在给定的金额下我能买到的最好的葡萄酒是什么? 创建一个系列
,其索引是葡萄酒价格,其值是评论中给出的价格相同的葡萄酒的最大分数。 按价格升序对值进行排序(4.0
美元位于顶部,3300.0
美元位于底部)。
#best_rating_per_price = ____
best_rating_per_price = reviews.groupby(['price'])['points'].max().sort_index()
# Check your answer
q2.check()
best_rating_per_price
Correct
price
4.0 86
5.0 87
..
2500.0 96
3300.0 88
Name: points, Length: 390, dtype: int64
# q2.hint()
q2.solution()
Solution:
best_rating_per_price = reviews.groupby('price')['points'].max().sort_index()
3.
What are the minimum and maximum prices for each variety
of wine? Create a DataFrame
whose index is the variety
category from the dataset and whose values are the min
and max
values thereof.
每种品种
葡萄酒的最低和最高价格是多少? 创建一个DataFrame
,其索引是数据集中的variety
类别,其值是其min
和max
值。
#price_extremes = ____
price_extremes = reviews.groupby(['variety'])['price'].agg(['min','max'])
# Check your answer
q3.check()
price_extremes
Correct
min | max | |
---|---|---|
variety | ||
Abouriou | 15.0 | 75.0 |
Agiorgitiko | 10.0 | 66.0 |
... | ... | ... |
Çalkarası | 19.0 | 19.0 |
Žilavka | 15.0 | 15.0 |
707 rows × 2 columns
# q3.hint()
q3.solution()
Solution:
price_extremes = reviews.groupby('variety').price.agg([min, max])
4.
What are the most expensive wine varieties? Create a variable sorted_varieties
containing a copy of the dataframe from the previous question where varieties are sorted in descending order based on minimum price, then on maximum price (to break ties).
最昂贵的葡萄酒品种有哪些? 创建一个变量sorted_varieties
,其中包含上一个问题的DataFrame的副本,其中品种根据最低价格按降序排序,然后根据最高价格(以打破平局)。
#sorted_varieties = ____
sorted_varieties = price_extremes.sort_values(['min', 'max'], ascending=[False, False])
# Check your answer
q4.check()
sorted_varieties
Correct
min | max | |
---|---|---|
variety | ||
Ramisco | 495.0 | 495.0 |
Terrantez | 236.0 | 236.0 |
... | ... | ... |
Vital | NaN | NaN |
Zelen | NaN | NaN |
707 rows × 2 columns
# q4.hint()
q4.solution()
Solution:
sorted_varieties = price_extremes.sort_values(by=['min', 'max'], ascending=False)
5.
Create a Series
whose index is reviewers and whose values is the average review score given out by that reviewer. Hint: you will need the taster_name
and points
columns.
创建一个Series
,其索引为审阅者,其值为该审阅者给出的平均审阅分数。 提示:您将需要taster_name
和points
列。
#reviewer_mean_ratings = ____
reviewer_mean_ratings = reviews.groupby(['taster_name'])['points'].mean()
# Check your answer
q5.check()
reviewer_mean_ratings
Correct
taster_name
Alexander Peartree 85.855422
Anna Lee C. Iijima 88.415629
...
Susan Kostrzewa 86.609217
Virginie Boone 89.213379
Name: points, Length: 19, dtype: float64
#q5.hint()
q5.solution()
Solution:
reviewer_mean_ratings = reviews.groupby('taster_name').points.mean()
Are there significant differences in the average scores assigned by the various reviewers? Run the cell below to use the describe()
method to see a summary of the range of values.
不同评审者给出的平均分数是否存在显着差异? 运行下面的单元格以使用describe()
方法来查看值范围的摘要。
reviewer_mean_ratings.describe()
count 19.000000
mean 88.233026
...
75% 88.975256
max 90.562551
Name: points, Length: 8, dtype: float64
6.
What combination of countries and varieties are most common? Create a Series
whose index is a MultiIndex
of {country, variety}
pairs. For example, a pinot noir produced in the US should map to {"US", "Pinot Noir"}
. Sort the values in the Series
in descending order based on wine count.
哪种国家和品种的组合最常见? 创建一个Series
,其索引是{country,varie}
对的MultiIndex
。 例如,美国生产的黑皮诺应映射到{"US", "Pinot Noir"}
。 根据酒数按降序对系列
中的值进行排序。
#country_variety_counts = ____
country_variety_counts = reviews.groupby(['country', 'variety'])['country'].count().sort_values(ascending=False)
# Check your answer
q6.check()
country_variety_counts
Correct
country variety
US Pinot Noir 9885
Cabernet Sauvignon 7315
...
Mexico Rosado 1
Uruguay White Blend 1
Name: country, Length: 1612, dtype: int64
# q6.hint()
q6.solution()
Solution:
country_variety_counts = reviews.groupby(['country', 'variety']).size().sort_values(ascending=False)
Keep going
继续前进
Move on to the data types and missing data.
继续讨论数据类型和缺失数据。