This notebook is an exercise in the Pandas course. You can reference the tutorial at this link.
Introduction
介绍
Now you are ready to get a deeper understanding of your data.
现在您已准备好更深入地了解您的数据。
Run the following cell to load your data and some utility functions (including code to check your answers).
运行以下单元格来加载您的数据和一些实用函数(包括用于检查答案的代码)。
import pandas as pd
pd.set_option("display.max_rows", 5)
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
from learntools.core import binder; binder.bind(globals())
from learntools.pandas.summary_functions_and_maps import *
print("Setup complete.")
reviews.head()
Setup complete.
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
Exercises
1.
What is the median of the points
column in the reviews
DataFrame?
DataFrame reviews
中points
列的中位数是多少?
#median_points = ____
median_points = reviews['points'].median()
# Check your answer
q1.check()
Correct
#q1.hint()
q1.solution()
Solution:
median_points = reviews.points.median()
2.
What countries are represented in the dataset? (Your answer should not include any duplicates.)
数据集中有哪些国家/地区? (您的答案不应包含任何重复项。)
#countries = ____
countries = reviews['country'].unique()
# Check your answer
q2.check()
Correct
#q2.hint()
q2.solution()
Solution:
countries = reviews.country.unique()
3.
How often does each country appear in the dataset? Create a Series reviews_per_country
mapping countries to the count of reviews of wines from that country.
每个国家/地区在数据集中出现的频率是多少? 创建一个Seriesreviews_per_country
,将国家/地区映射到该国家/地区的葡萄酒评论数量。
#reviews_per_country = ____
reviews_per_country = reviews['country'].value_counts()
# Check your answer
q3.check()
Correct
#q3.hint()
q3.solution()
Solution:
reviews_per_country = reviews.country.value_counts()
4.
Create variable centered_price
containing a version of the price
column with the mean price subtracted.
创建变量centered_price
,其中包含price
列减去平均价格的版本。
(Note: this 'centering' transformation is a common preprocessing step before applying various machine learning algorithms.)
(注意:这种居中
转换是应用各种机器学习算法之前的常见预处理步骤。)
#centered_price = ____
centered_price = centered_price = reviews['price'] - reviews['price'].mean()
# Check your answer
q4.check()
Correct
#q4.hint()
q4.solution()
Solution:
centered_price = reviews.price - reviews.price.mean()
5.
I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable bargain_wine
with the title of the wine with the highest points-to-price ratio in the dataset.
我是一个考虑经济实惠的葡萄酒买家。 哪种酒是最划算的
? 创建一个变量bargain_wine
,其中包含数据集中性价比最高的葡萄酒的名称。
reviews[reviews['points']/reviews['price'] == (reviews['points']/reviews['price']).max()]['title']
64590 Bandit NV Merlot (California)
126096 Cramele Recas 2011 UnWineD Pinot Grigio (Viile...
Name: title, dtype: object
#bargain_wine = ____
# 这个答案有问题,这题应该有两个答案
bargain_wine = reviews[reviews['points']/reviews['price'] == (reviews['points']/reviews['price']).max()][['title']]
bargain_wine
#标准答案是错误的
bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']
# Check your answer
q5.check()
Correct
#q5.hint()
q5.solution()
Solution:
bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']
6.
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series descriptor_counts
counting how many times each of these two words appears in the description
column in the dataset. (For simplicity, let's ignore the capitalized versions of these words.)
在描述一瓶酒时,你只能使用这么多的词语。 葡萄酒更有可能是热带
还是果味
? 创建一个系列descriptor_counts
,计算这两个单词在数据集中的description
列中出现的次数。 (为简单起见,我们忽略这些单词的大写版本。)
#descriptor_counts = ____
n_tropical = reviews["description"].map(lambda p : "tropical" in p).sum()
n_fruity = reviews["description"].map(lambda p : "fruity" in p).sum()
descriptor_counts = pd.Series([n_tropical, n_fruity], index=['tropical', 'fruity'])
# Check your answer
q6.check()
descriptor_counts
Correct
tropical 3607
fruity 9090
dtype: int64
#q6.hint()
q6.solution()
Solution:
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
7.
We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.
我们希望在我们的网站上发布这些葡萄酒评论,但从 80 到 100 分的评级系统太难理解 - 我们希望将它们转化为简单的星级评级。 95 分或以上为 3 星,85 分以上但低于 95 分为 2 星。 任何其他分数均为 1 星。
Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.
此外,加拿大葡萄酒商协会在该网站上购买了大量广告,因此任何来自加拿大的葡萄酒都应该自动获得 3 星,无论分数如何。
Create a series star_ratings
with the number of stars corresponding to each review in the dataset.
创建一个Seriesstar_ ratings
,其中包含与数据集中每条评论相对应的星星数量。
def rate_star(row):
#star = 0
if row['country'] == 'Canada':
star = 3
else:
if row['points'] >= 95:
star = 3
elif row['points'] < 85:
star = 1
else:
star = 2
return star
#star_ratings = ____
star_ratings = reviews.apply(rate_star, axis=1)
# Check your answer
q7.check()
Correct
#q7.hint()
q7.solution()
Solution:
def stars(row):
if row.country == 'Canada':
return 3
elif row.points >= 95:
return 3
elif row.points >= 85:
return 2
else:
return 1
star_ratings = reviews.apply(stars, axis='columns')
Keep going
继续前进
Continue to grouping and sorting.
继续分组和排序。