Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

04.exercise-character-encodings【练习:字符编码】

This notebook is an exercise in the Data Cleaning course. You can reference the tutorial at this link.


In this exercise, you'll apply what you learned in the Character encodings tutorial.

在本练习中,您将应用在 字符编码 教程中所学到的知识。

Setup

设置

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

以下问题将为您提供有关您工作的反馈。 运行以下单元格来设置反馈系统。

from learntools.core import binder
binder.bind(globals())
from learntools.data_cleaning.ex4 import *
print("Setup Complete")
Setup Complete

Get our environment set up

设置我们的环境

The first thing we'll need to do is load in the libraries we'll be using.

我们需要做的第一件事是加载我们将使用的库。

# modules we'll use
import pandas as pd
import numpy as np

# helpful character encoding module
import chardet

# set seed for reproducibility
np.random.seed(0)

1) What are encodings?

1) 什么是编码?

You're working with a dataset composed of bytes. Run the code cell below to print a sample entry.

您正在处理由字节组成的数据集。 运行下面的代码单元格来打印示例条目。

sample_entry = b'\xa7A\xa6n'
print(sample_entry)
print('data type:', type(sample_entry))
b'\xa7A\xa6n'
data type: 
chardet.detect(sample_entry)
{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}

You notice that it doesn't use the standard UTF-8 encoding.

您会注意到它没有使用标准 UTF-8 编码。

Use the next code cell to create a variable new_entry that changes the encoding from "big5-tw" to "utf-8". new_entry should have the bytes datatype.

使用下一个代码单元创建一个变量new_entry,将编码从big5-tw更改为utf-8new_entry 应该具有字节数据类型。

new_entry = sample_entry.decode("big5-tw").encode("utf-8")

# Check your answer
q1.check()

Correct

# Lines below will give you a hint or solution code
#q1.hint()
q1.solution()

Solution:

before = sample_entry.decode("big5-tw")
new_entry = before.encode()

2) Reading in files with encoding problems

2) 读入有编码问题的文件

Use the code cell below to read in this file at path "../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv".

使用下面的代码单元格读取路径../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv的文件。

Figure out what the correct encoding should be and read in the file to a DataFrame police_killings.

找出正确的编码应该是什么,并将文件读入 DataFramepolice_killings

with open("../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))

result
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
pd.read_csv("../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv", encoding="Windows-1252").head()
id name date manner_of_death armed age gender race city state signs_of_mental_illness threat_level flee body_camera
0 3 Tim Elliot 02/01/15 shot gun 53.0 M A Shelton WA True attack Not fleeing False
1 4 Lewis Lee Lembke 02/01/15 shot gun 47.0 M W Aloha OR False attack Not fleeing False
2 5 John Paul Quintero 03/01/15 shot and Tasered unarmed 23.0 M H Wichita KS False other Not fleeing False
3 8 Matthew Hoffman 04/01/15 shot toy weapon 32.0 M W San Francisco CA True attack Not fleeing False
4 9 Michael Rodriguez 04/01/15 shot nail gun 39.0 M H Evans CO False attack Not fleeing False
# TODO: Load in the DataFrame correctly.
police_killings = pd.read_csv("../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv", encoding="Windows-1252")

# Check your answer
q2.check()

Correct

Feel free to use any additional code cells for supplemental work. To get credit for finishing this question, you'll need to run q2.check() and get a result of Correct.

请随意使用任何其他代码单元来进行补充工作。 要获得完成此问题的得分,您需要运行q2.check()并获得正确的结果。

# (Optional) Use this code cell for any additional work.
# Lines below will give you a hint or solution code
# q2.hint()
q2.solution()

Solution:

police_killings = pd.read_csv("../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv", encoding='Windows-1252')

3) Saving your files with UTF-8 encoding

3) 使用 UTF-8 编码保存文件

Save a version of the police killings dataset to CSV with UTF-8 encoding. Your answer will be marked correct after saving this file.

使用 UTF-8 编码将警察杀人数据集的版本保存到 CSV。 保存此文件后,您的答案将被标记为正确。

Note: When using the to_csv() method, supply only the name of the file (e.g., "my_file.csv"). This saves the file at the filepath "/kaggle/working/my_file.csv".

注意:使用to_csv()方法时,仅提供文件名(例如my_file.csv)。 这会将文件保存在文件路径/kaggle/working/my_file.csv中。

# TODO: Save the police killings dataset to CSV
police_killings.to_csv("/kaggle/working/my_file.csv", encoding="utf-8")

# Check your answer
q3.check()

Correct

# Lines below will give you a hint or solution code
#q3.hint()
q3.solution()

Solution:


police_killings.to_csv("my_file.csv")

(Optional) More practice

(可选)更多练习

Check out this dataset of files in different character encodings. Can you read in all the files with their original encodings and them save them out as UTF-8 files?

查看不同字符编码的文件数据集。 你能用原始编码读入所有文件并将它们保存为 UTF-8 文件吗?

If you have a file that's in UTF-8 but has just a couple of weird-looking characters in it, you can try out the ftfy module and see if it helps.

如果您有一个 UTF-8 格式的文件,但其中只有几个看起来很奇怪的字符,您可以尝试 ftfy 模块 并看看是否有帮助。

Keep going

继续前进

In the final lesson, learn how to clean up inconsistent text entries in your dataset.

在最后一课中,学习如何清理数据集中不一致的文本条目

04.exercise-character-encodings【练习:字符编码】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top