How to solve file encoding error in python.

Shashank Shanu

a year ago

Solve File Encoding Error In Python
Solve File Encoding Error In Python
For example: Error 'utf-8' codec can't decode byte 0x92 in position 4: invalid start byte
Many times, when you load your datasets you may come across some encoding error which is sometimes quite irritating because of the long error messages they raised. This message is quite complex to decode.
Today I will try to explain to you why is an error occurs and how we can solve this error in the simplest way in python.
To do so let me take an example and explain to you.
Let’s say I am trying to read a dataset and saved it into a variable called df1.
import pandas as pd

df1=pd.read_csv(“https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv”,sep=";")

df1.head()
But as I execute the above code, I am getting error an error shown below.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
So, the main question arises why we are getting this error and how can we solve it.
When I dig deeper and try to understand why this error is coming. I got to know that the data is indeed not encoded as UTF-8; in this dataset, everything is ASCII except for that single 0x92 byte as shown below:
b'Korea, Dem. People\x92s Rep.'
So to solve this error and load the dataset and do our further tasks we have to decode it as Windows codepage 1252 instead, where 0x92 is a fancy quote, . It can be done as shown below.
df1= pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";", encoding='cp1252')
Let’s take a full example and work around to handle this type of error.
import pandas as pd
df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv", sep=";", encoding='cp1252')
df1.head()
Output:
                                      2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  \
0     Afghanistan       55.1  55.5  55.9  56.2  56.6  57.0  57.4  57.8  58.2  58.6
1         Albania           74.3  74.7  75.2  75.5  75.8  76.1  76.3  76.5  76.7  76.8
2         Algeria            70.2  70.6  71.0  71.4  71.8  72.2  72.6  72.9  73.2  73.5
3  American Samoa    ..    ..    ..    ..    ..    ..    ..    ..    ..    ..
4         Andorra    ..    ..    ..    ..    ..    ..    ..    ..    ..    ..

                                     2010  2011  2012  2013  Unnamed: 15  2014  2015
0                                  59.0  59.3  59.7  60.0          NaN  60.4  60.7
1                                  77.0  77.2  77.4  77.6          NaN  77.8  78.0
2                                  73.8  74.1  74.3  74.6          NaN  74.8  75.0
3                                   ..    ..    ..    ..          NaN    ..    ..
4                                   ..    ..    ..    ..          NaN    ..    ..
However, I noticed that pandas take the HTTP headers at face value too and produces a Mojibake when you load your data from a URL. When I try to save the data directly on to the disk, then load it with pd.read_csv() the data is correctly decoded, but loading from the URL produces re-coded data:
df1[' '][102]

Output:
output: 'Korea, Dem. People’s Rep.'
So, now we have to decode it as shown below:
df1[' '][102].encode('cp1252').decode('utf8')
Output:
Output: 'Korea, Dem. People’s Rep.'
We can see from the above code that how the trademark symbol is removed after applying encoding on it. This is a known bug in Pandas. You can also solve this issue by using urllib.request to load the URL and pass that to pd.read_csv() as shown below:
import urllib.request
with urllib.request.urlopen("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv") as resp:
    df1 = pd.read_csv(resp, sep=";", encoding='cp1252')
df1[' '][102]
Output:
'Korea, Dem. People’s Rep.'
I hope after you enjoyed reading this article and finally, you came to know about How to solve file encoding error in python.
For more such blogs/courses on data science, machine learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.
Thanks for reading…
Happy Programming…

Submit Review