Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

03.exercise-summary-functions-and-maps【练习:摘要函数及映射】

This notebook is an exercise in the Pandas course. You can reference the tutorial at this link.


Introduction

介绍

Now you are ready to get a deeper understanding of your data.

现在您已准备好更深入地了解您的数据。

Run the following cell to load your data and some utility functions (including code to check your answers).

运行以下单元格来加载您的数据和一些实用函数(包括用于检查答案的代码)。

import pandas as pd
pd.set_option("display.max_rows", 5)
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.summary_functions_and_maps import *
print("Setup complete.")

reviews.head()
Setup complete.
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

Exercises

1.

What is the median of the points column in the reviews DataFrame?

DataFrame reviewspoints 列的中位数是多少?

#median_points = ____

median_points = reviews['points'].median()

# Check your answer
q1.check()

Correct

#q1.hint()
q1.solution()

Solution:

median_points = reviews.points.median()

2.

What countries are represented in the dataset? (Your answer should not include any duplicates.)

数据集中有哪些国家/地区? (您的答案不应包含任何重复项。)

#countries = ____

countries = reviews['country'].unique()

# Check your answer
q2.check()

Correct

#q2.hint()
q2.solution()

Solution:

countries = reviews.country.unique()

3.

How often does each country appear in the dataset? Create a Series reviews_per_country mapping countries to the count of reviews of wines from that country.

每个国家/地区在数据集中出现的频率是多少? 创建一个Seriesreviews_per_country,将国家/地区映射到该国家/地区的葡萄酒评论数量。

#reviews_per_country = ____

reviews_per_country = reviews['country'].value_counts()

# Check your answer
q3.check()

Correct

#q3.hint()
q3.solution()

Solution:

reviews_per_country = reviews.country.value_counts()

4.

Create variable centered_price containing a version of the price column with the mean price subtracted.

创建变量centered_price,其中包含price列减去平均价格的版本。

(Note: this 'centering' transformation is a common preprocessing step before applying various machine learning algorithms.)

(注意:这种居中转换是应用各种机器学习算法之前的常见预处理步骤。)

#centered_price = ____

centered_price = centered_price = reviews['price'] - reviews['price'].mean()

# Check your answer
q4.check()

Correct

#q4.hint()
q4.solution()

Solution:

centered_price = reviews.price - reviews.price.mean()

5.

I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.

我是一个考虑经济实惠的葡萄酒买家。 哪种酒是最划算的? 创建一个变量bargain_wine,其中包含数据集中性价比最高的葡萄酒的名称。

reviews[reviews['points']/reviews['price'] == (reviews['points']/reviews['price']).max()]['title']
64590                         Bandit NV Merlot (California)
126096    Cramele Recas 2011 UnWineD Pinot Grigio (Viile...
Name: title, dtype: object
#bargain_wine = ____
# 这个答案有问题,这题应该有两个答案
bargain_wine = reviews[reviews['points']/reviews['price'] == (reviews['points']/reviews['price']).max()][['title']]
bargain_wine
#标准答案是错误的
bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']

# Check your answer
q5.check()

Correct

#q5.hint()
q5.solution()

Solution:

bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']

6.

There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset. (For simplicity, let's ignore the capitalized versions of these words.)

在描述一瓶酒时,你只能使用这么多的词语。 葡萄酒更有可能是热带还是果味? 创建一个系列descriptor_counts,计算这两个单词在数据集中的description列中出现的次数。 (为简单起见,我们忽略这些单词的大写版本。)

#descriptor_counts = ____

n_tropical = reviews["description"].map(lambda p : "tropical" in p).sum()
n_fruity = reviews["description"].map(lambda p : "fruity" in p).sum()
descriptor_counts = pd.Series([n_tropical, n_fruity], index=['tropical', 'fruity'])

# Check your answer
q6.check()

descriptor_counts

Correct

tropical    3607
fruity      9090
dtype: int64
#q6.hint()
q6.solution()

Solution:

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

7.

We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

我们希望在我们的网站上发布这些葡萄酒评论,但从 80 到 100 分的评级系统太难理解 - 我们希望将它们转化为简单的星级评级。 95 分或以上为 3 星,85 分以上但低于 95 分为 2 星。 任何其他分数均为 1 星。

Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

此外,加拿大葡萄酒商协会在该网站上购买了大量广告,因此任何来自加拿大的葡萄酒都应该自动获得 3 星,无论分数如何。

Create a series star_ratings with the number of stars corresponding to each review in the dataset.

创建一个Seriesstar_ ratings,其中包含与数据集中每条评论相对应的星星数量。

def rate_star(row):
    #star = 0
    if row['country'] == 'Canada':
        star = 3
    else:
        if row['points'] >= 95:
            star = 3
        elif row['points'] < 85:
            star = 1
        else:
            star = 2
    return star
#star_ratings = ____
star_ratings = reviews.apply(rate_star, axis=1)
# Check your answer
q7.check()

Correct

#q7.hint()
q7.solution()

Solution:

def stars(row):
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1

star_ratings = reviews.apply(stars, axis='columns')

Keep going

继续前进

Continue to grouping and sorting.

继续分组和排序


03.exercise-summary-functions-and-maps【练习:摘要函数及映射】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top