This notebook is an exercise in the Pandas course. You can reference the tutorial at this link.
Introduction
介绍
Run the following cell to load your data and some utility functions.
运行以下单元格来加载数据和一些实用函数。
import pandas as pd
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
from learntools.core import binder; binder.bind(globals())
from learntools.pandas.data_types_and_missing_data import *
print("Setup complete.")
Setup complete.
Exercises
练习
1.
What is the data type of the points
column in the dataset?
数据集中points
列的数据类型是什么?
# Your code here
#dtype = ____
dtype = reviews['points'].dtype
# Check your answer
q1.check()
dtype
Correct
dtype('int64')
#q1.hint()
q1.solution()
Solution:
dtype = reviews.points.dtype
2.
Create a Series from entries in the points
column, but convert the entries to strings. Hint: strings are str
in native Python.
从points
列中的条目创建一个Series,但将条目转换为字符串。 提示:字符串在本机 Python 中是str
。
#point_strings = ____
# point_strings = reviews['points'].apply(lambda x: str(x))
point_strings = reviews['points'].astype(str)
# Check your answer
q2.check()
point_strings
Correct
0 87
1 87
2 87
3 87
4 87
..
129966 90
129967 90
129968 90
129969 90
129970 90
Name: points, Length: 129971, dtype: object
#q2.hint()
q2.solution()
Solution:
point_strings = reviews.points.astype(str)
3.
Sometimes the price column is null. How many reviews in the dataset are missing a price?
有时价格列为空。 数据集中有多少评论缺少价格?
#n_missing_prices = ____
n_missing_prices = reviews['price'].isnull().sum()
# Check your answer
q3.check()
n_missing_prices
Correct
8996
#q3.hint()
q3.solution()
Solution:
missing_price_reviews = reviews[reviews.price.isnull()]
n_missing_prices = len(missing_price_reviews)
# Cute alternative solution: if we sum a boolean series, True is treated as 1 and False as 0
n_missing_prices = reviews.price.isnull().sum()
# or equivalently:
n_missing_prices = pd.isnull(reviews.price).sum()
4.
What are the most common wine-producing regions? Create a Series counting the number of times each value occurs in the region_1
field. This field is often missing data, so replace missing values with Unknown
. Sort in descending order. Your output should look something like this:
最常见的葡萄酒产区有哪些? 创建一个系列,计算每个值在region_1
字段中出现的次数。 该字段经常缺少数据,因此将缺少的值替换为Unknown
。 按降序排列。 你的输出应该是这样的:
Unknown 21247
Napa Valley 4480
...
Bardolino Superiore 1
Primitivo del Tarantino 1
Name: region_1, Length: 1230, dtype: int64
#reviews_per_region = ____
# reviews_per_region = reviews[['region_1']].fillna('Unknown').groupby(['region_1']).size().sort_values(ascending=False)
reviews_per_region = reviews.loc[:,'region_1'].fillna('Unknown').value_counts().sort_values(ascending=False)
# Check your answer
q4.check()
Correct
#q4.hint()
q4.solution()
Solution:
reviews_per_region = reviews.region_1.fillna('Unknown').value_counts().sort_values(ascending=False)
Keep going
继续
Move on to renaming and combining.
继续重命名和组合。