Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

04.course-character-encodings【字符编码】

In this notebook, we're going to be working with different character encodings.

在本笔记本中,我们将使用不同的字符编码。

Let's get started!

让我们开始吧!

Get our environment set up

设置我们的环境

The first thing we'll need to do is load in the libraries we'll be using. Not our dataset, though: we'll get to it later!

我们需要做的第一件事是加载我们将使用的库。 不过,这不是我们的数据集:我们稍后会处理它!

# modules we'll use
import pandas as pd
import numpy as np

# helpful character encoding module
import charset_normalizer

# set seed for reproducibility
np.random.seed(0)

What are encodings?

什么是编码?

Character encodings are specific sets of rules for mapping from raw binary byte strings (that look like this: 0110100001101001) to characters that make up human-readable text (like "hi"). There are many different encodings, and if you tried to read in text with a different encoding than the one it was originally written in, you ended up with scrambled text called "mojibake" (said like mo-gee-bah-kay). Here's an example of mojibake:

字符编码是用于从原始二进制字节字符串(如下所示:0110100001101001)映射到构成人类可读文本的字符(如hi)的特定规则集。 有许多不同的编码,如果您尝试使用与最初编写的编码不同的编码来读取文本,您最终会得到称为mojibake的乱序文本(读上去像 mo-gee-bah-kay )。 这是 mojibake 的示例:

æ–‡å—化ã??

You might also end up with a "unknown" characters. There are what gets printed when there's no mapping between a particular byte and a character in the encoding you're using to read your byte string in and they look like this:

您也可能会遇到未知字符。 当您用来读取字节字符串的编码中的特定字节和字符之间没有映射时,会打印一些内容,它们看起来像这样:

����������

Character encoding mismatches are less common today than they used to be, but it's definitely still a problem. There are lots of different character encodings, but the main one you need to know is UTF-8.

如今,字符编码不匹配的情况比以前少见了,但这仍然是一个问题。 有许多不同的字符编码,但您需要了解的主要一种是 UTF-8。

UTF-8 is the standard text encoding. All Python code is in UTF-8 and, ideally, all your data should be as well. It's when things aren't in UTF-8 that you run into trouble.

UTF-8 是一种标准文本编码。 所有 Python 代码均采用 UTF-8 格式,理想情况下,所有数据也应采用 UTF-8 格式。 当内容不是 UTF-8 时,您就会遇到麻烦。

It was pretty hard to deal with encodings in Python 2, but thankfully in Python 3 it's a lot simpler. (Kaggle Notebooks only use Python 3.) There are two main data types you'll encounter when working with text in Python 3. One is is the string, which is what text is by default.

在 Python 2 中处理编码非常困难,但幸运的是在 Python 3 中它要简单得多。 (Kaggle Notebooks 仅使用 Python 3。)在 Python 3 中处理文本时,您会遇到两种主要数据类型。一种是字符串,这是默认情况下的文本。

# start with a string
before = "This is the euro symbol: €"

# check to see what datatype it is
type(before)
str

The other data is the bytes data type, which is a sequence of integers. You can convert a string into bytes by specifying which encoding it's in:

另一种数据是 bytes 数据类型,它是整数序列。 您可以通过指定字符串的编码将字符串转换为字节:

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("utf-8", errors="replace")

# check the type
type(after)
bytes

If you look at a bytes object, you'll see that it has a b in front of it, and then maybe some text after. That's because bytes are printed out as if they were characters encoded in ASCII. (ASCII is an older character encoding that doesn't really work for writing any language other than English.) Here you can see that our euro symbol has been replaced with some mojibake that looks like "\xe2\x82\xac" when it's printed as if it were an ASCII string.

如果您查看 bytes 对象,您会发现它前面有一个 b,后面可能还有一些文本。 这是因为字节被打印出来,就好像它们是用 ASCII 编码的字符一样。 (ASCII 是一种较旧的字符编码,实际上不适用于编写英语以外的任何语言。)在这里您可以看到我们的欧元符号已被一些打印时看起来像\xe2\x82\xac的 mojibake 所取代 就像它是一个 ASCII 字符串一样。

# take a look at what the bytes look like
after
b'This is the euro symbol: \xe2\x82\xac'

When we convert our bytes back to a string with the correct encoding, we can see that our text is all there correctly, which is great! 🙂

当我们将字节转换回具有正确编码的字符串时,我们可以看到我们的文本都是正确的,这太棒了! 🙂

# convert it back to utf-8
print(after.decode("utf-8"))
This is the euro symbol: €

However, when we try to use a different encoding to map our bytes into a string, we get an error. This is because the encoding we're trying to use doesn't know what to do with the bytes we're trying to pass it. You need to tell Python the encoding that the byte string is actually supposed to be in.

但是,当我们尝试使用不同的编码将字节映射到字符串时,我们会收到错误。 这是因为我们尝试使用的编码不知道如何处理我们尝试传递的字节。 您需要告诉 Python 字节转换字符串实际应该采用的编码。

You can think of different encodings as different ways of recording music. You can record the same music on a CD, cassette tape or 8-track. While the music may sound more-or-less the same, you need to use the right equipment to play the music from each recording format. The correct decoder is like a cassette player or a CD player. If you try to play a cassette in a CD player, it just won't work.

您可以将不同的编码视为录制音乐的不同方式。 您可以将相同的音乐录制在 CD、盒式磁带或 8 轨上。 虽然音乐听起来或多或少相同,但您需要使用正确的设备来播放每种录音格式的音乐。 正确的解码器就像磁带播放器或 CD 播放器。 如果您尝试在 CD 播放器中播放盒式磁带,它根本无法播放。

# try to decode our bytes with the ascii encoding
print(after.decode("ascii"))
---------------------------------------------------------------------------

UnicodeDecodeError                        Traceback (most recent call last)

Cell In[6], line 2
      1 # try to decode our bytes with the ascii encoding
----> 2 print(after.decode("ascii"))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)

We can also run into trouble if we try to use the wrong encoding to map from a string to bytes. Like I said earlier, strings are UTF-8 by default in Python 3, so if we try to treat them like they were in another encoding we'll create problems.

如果我们尝试使用错误的编码从字符串映射到字节,我们也会遇到麻烦。 正如我之前所说,Python 3 中的字符串默认为 UTF-8,因此如果我们尝试像使用其他编码一样对待它们,就会产生问题。

For example, if we try to convert a string to bytes for ASCII using encode(), we can ask for the bytes to be what they would be if the text was in ASCII. Since our text isn't in ASCII, though, there will be some characters it can't handle. We can automatically replace the characters that ASCII can't handle. If we do that, however, any characters not in ASCII will just be replaced with the unknown character. Then, when we convert the bytes back to a string, the character will be replaced with the unknown character. The dangerous part about this is that there's not way to tell which character it should have been. That means we may have just made our data unusable!

例如,如果我们尝试使用encode()将字符串转换为 ASCII 字节,我们可以要求字节为 ASCII 文本所编码的。 不过,由于我们的文本不是 ASCII 格式的,因此会有一些它无法处理的字符。 我们可以自动替换 ASCII 无法处理的字符。 但是,如果我们这样做,任何非 ASCII 字符都将被替换为未知字符。 然后,当我们将字节转换回字符串时,该字符将被替换为未知字符。 危险之处在于,无法判断它应该是哪个字符。 这意味着我们可能使我们的数据无法使用了!

# start with a string
before = "This is the euro symbol: €"

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")

# convert it back to utf-8
print(after.decode("ascii"))

# We've lost the original underlying byte string! It's been 
# replaced with the underlying byte string for the unknown character :(
This is the euro symbol: ?

This is bad and we want to avoid doing it! It's far better to convert all our text to UTF-8 as soon as we can and keep it in that encoding. The best time to convert non UTF-8 input into UTF-8 is when you read in files, which we'll talk about next.

这很糟糕,我们想避免这样做! 最好尽快将所有文本转换并保存为 UTF-8 编码。 将非 UTF-8 输入转换为 UTF-8 的最佳时机是在读取文件时,我们将在接下来讨论这一点。

Reading in files with encoding problems

读入有编码问题的文件

Most files you'll encounter will probably be encoded with UTF-8. This is what Python expects by default, so most of the time you won't run into problems. However, sometimes you'll get an error like this:

您遇到的大多数文件可能都是用 UTF-8 编码的。 这是 Python 默认所期望的,所以大多数时候你不会遇到问题。 但是,有时您会收到如下错误:

# try to read in a file not in UTF-8
kickstarter_2016 = pd.read_csv("../00 datasets/kemical/kickstarter-projects/ks-projects-201612.csv")
---------------------------------------------------------------------------

UnicodeDecodeError                        Traceback (most recent call last)

Cell In[8], line 2
      1 # try to read in a file not in UTF-8
----> 2 kickstarter_2016 = pd.read_csv("../00 datasets/kemical/kickstarter-projects/ks-projects-201612.csv")

File ~/Env/pyenv/jupyter/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1024, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1011 kwds_defaults = _refine_defaults_read(
   1012     dialect,
   1013     delimiter,
   (...)
   1020     dtype_backend=dtype_backend,
   1021 )
   1022 kwds.update(kwds_defaults)
-> 1024 return _read(filepath_or_buffer, kwds)

File ~/Env/pyenv/jupyter/lib/python3.10/site-packages/pandas/io/parsers/readers.py:618, in _read(filepath_or_buffer, kwds)
    615 _validate_names(kwds.get("names", None))
    617 # Create the parser.
--> 618 parser = TextFileReader(filepath_or_buffer, **kwds)
    620 if chunksize or iterator:
    621     return parser

File ~/Env/pyenv/jupyter/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1618, in TextFileReader.__init__(self, f, engine, **kwds)
   1615     self.options["has_index_names"] = kwds["has_index_names"]
   1617 self.handles: IOHandles | None = None
-> 1618 self._engine = self._make_engine(f, self.engine)

File ~/Env/pyenv/jupyter/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1896, in TextFileReader._make_engine(self, f, engine)
   1893     raise ValueError(msg)
   1895 try:
-> 1896     return mapping[engine](f, **self.options)
   1897 except Exception:
   1898     if self.handles is not None:

File ~/Env/pyenv/jupyter/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py:93, in CParserWrapper.__init__(self, src, **kwds)
     90 if kwds["dtype_backend"] == "pyarrow":
     91     # Fail here loudly instead of in cython after reading
     92     import_optional_dependency("pyarrow")
---> 93 self._reader = parsers.TextReader(src, **kwds)
     95 self.unnamed_cols = self._reader.unnamed_cols
     97 # error: Cannot determine type of 'names'

File parsers.pyx:574, in pandas._libs.parsers.TextReader.__cinit__()

File parsers.pyx:663, in pandas._libs.parsers.TextReader._get_header()

File parsers.pyx:874, in pandas._libs.parsers.TextReader._tokenize_rows()

File parsers.pyx:891, in pandas._libs.parsers.TextReader._check_tokenize_status()

File parsers.pyx:2053, in pandas._libs.parsers.raise_parser_error()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 7955: invalid start byte

Notice that we get the same UnicodeDecodeError we got when we tried to decode UTF-8 bytes as if they were ASCII! This tells us that this file isn't actually UTF-8. We don't know what encoding it actually is though. One way to figure it out is to try and test a bunch of different character encodings and see if any of them work. A better way, though, is to use the charset_normalizer module to try and automatically guess what the right encoding is. It's not 100% guaranteed to be right, but it's usually faster than just trying to guess.

请注意,当我们尝试将 UTF-8 字节解码为 ASCII 时,我们得到了同样的UnicodeDecodeError! 这告诉我们这个文件实际上不是 UTF-8。 但我们不知道它实际上是什么编码。 解决这个问题的一种方法是尝试测试一堆不同的字符编码,看看它们是否有效。 不过,更好的方法是使用 charset_normalizer 模块来尝试自动猜测正确的编码是什么。 它不能 100% 保证正确,但通常比仅仅尝试猜测要快。

I'm going to just look at the first ten thousand bytes of this file. This is usually enough for a good guess about what the encoding is and is much faster than trying to look at the whole file. (Especially with a large file this can be very slow.) Another reason to just look at the first part of the file is that we can see by looking at the error message that the first problem is the 11th character. So we probably only need to look at the first little bit of the file to figure out what's going on.

我将只查看该文件的前一万个字节。 这通常足以很好地猜测编码是什么,并且比尝试查看整个文件要快得多。 (特别是对于大文件,这可能会非常慢。)仅查看文件第一部分的另一个原因是,通过查看错误消息,我们可以看到第一个问题是第 11 个字符。 所以我们可能只需要查看文件的前面部分就可以弄清楚发生了什么。

import chardet
# look at the first ten thousand bytes to guess the character encoding
with open("../00 datasets/kemical/kickstarter-projects/ks-projects-201801.csv", 'rb') as rawdata:
    detect_bytes = rawdata.read(10000)
    result_charset_normalizer = charset_normalizer.detect(detect_bytes)
    # check what the character encoding might be # 目前已经猜不准了……
    print(result_charset_normalizer)
    result_chardet = chardet.detect(detect_bytes)
    print(result_chardet)
{'encoding': 'utf-8', 'language': 'English', 'confidence': 1.0}
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

So charset_normalizer is 73% confidence that the right encoding is "Windows-1252". Let's see if that's correct:

因此 charset_normalizer 对正确编码为Windows-1252的置信度为 73%。 让我们看看这是否正确:

# read in the file with the encoding detected by charset_normalizer
kickstarter_2016 = pd.read_csv("../00 datasets/kemical/kickstarter-projects/ks-projects-201612.csv", encoding='Windows-1252')

# look at the first few lines
kickstarter_2016.head()
/tmp/ipykernel_5319/17202725.py:2: DtypeWarning: Columns (13,14,15) have mixed types. Specify dtype option on import or set low_memory=False.
  kickstarter_2016 = pd.read_csv("../00 datasets/kemical/kickstarter-projects/ks-projects-201612.csv", encoding='Windows-1252')
ID name category main_category currency deadline goal launched pledged state backers country usd pledged Unnamed: 13 Unnamed: 14 Unnamed: 15 Unnamed: 16
0 1000002330 The Songs of Adelaide & Abullah Poetry Publishing GBP 2015-10-09 11:36:00 1000 2015-08-11 12:12:28 0 failed 0 GB 0 NaN NaN NaN NaN
1 1000004038 Where is Hank? Narrative Film Film & Video USD 2013-02-26 00:20:50 45000 2013-01-12 00:20:50 220 failed 3 US 220 NaN NaN NaN NaN
2 1000007540 ToshiCapital Rekordz Needs Help to Complete Album Music Music USD 2012-04-16 04:24:11 5000 2012-03-17 03:24:11 1 failed 1 US 1 NaN NaN NaN NaN
3 1000011046 Community Film Project: The Art of Neighborhoo... Film & Video Film & Video USD 2015-08-29 01:00:00 19500 2015-07-04 08:35:03 1283 canceled 14 US 1283 NaN NaN NaN NaN
4 1000014025 Monarch Espresso Bar Restaurants Food USD 2016-04-01 13:38:27 50000 2016-02-26 13:38:27 52375 successful 224 US 52375 NaN NaN NaN NaN

Yep, looks like charset_normalizer was right! The file reads in with no problem (although we do get a warning about datatypes) and when we look at the first few rows it seems to be fine.

是的,看起来 charset_normalizer 是对的! 该文件读入没有问题(尽管我们确实收到了有关数据类型的警告),并且当我们查看前几行时,它似乎没问题。

What if the encoding charset_normalizer guesses isn't right? Since charset_normalizer is basically just a fancy guesser, sometimes it will guess the wrong encoding. One thing you can try is looking at more or less of the file and seeing if you get a different result and then try that.

如果 charset_normalizer 猜测的编码不正确怎么办? 由于 charset_normalizer 基本上只是一个奇特的猜测器,有时它会猜测错误的编码。 您可以尝试的一件事是查看或多或少的文件,看看是否得到不同的结果,然后尝试。

Saving your files with UTF-8 encoding

使用 UTF-8 编码保存文件

Finally, once you've gone through all the trouble of getting your file into UTF-8, you'll probably want to keep it that way. The easiest way to do that is to save your files with UTF-8 encoding. The good news is, since UTF-8 is the standard encoding in Python, when you save a file it will be saved as UTF-8 by default:

最后,一旦您经历了将文件转换为 UTF-8 的所有麻烦之后,您可能会希望保持这种状态。 最简单的方法是使用 UTF-8 编码保存文件。 好消息是,由于 UTF-8 是 Python 中的标准编码,因此当您保存文件时,它会默认保存为 UTF-8:

# save our file (will be saved as UTF-8 by default!)
kickstarter_2016.to_csv("ks-projects-201801-utf8.csv")

Pretty easy, huh? 🙂

很容易,对吧? 🙂

Your turn!

到你了!

Deepen your understanding with a dataset of fatal police shootings in the US.

使用美国致命警察枪击事件的数据集来加深您的理解

04.course-character-encodings【字符编码】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top