UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

json python-2.7 utf-8 ascii python-unicode

In my case(mac os), there was .DS_store file in my data folder which was a hidden and auto generated file and it caused the issue. I was able to fix the problem after removing it.

json python-2.7 utf-8 ascii python-unicode

It doesn't help that you have sys.setdefaultencoding('utf-8'), which is confusing things further - It's a nasty hack and you need to remove it from your code. See https://stackoverflow.com/a/34378962/1554386 for more information

The error is happening because line is a string and you're calling encode(). encode() only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using the default encoding, which in your case is UTF-8, but should be ASCII. Either way, 0x80 is not valid ASCII or UTF-8 so fails.

0x80 is valid in some characters sets. In windows-1252/cp1252 it's €.

The trick here is to understand the encoding of your data all the way through your code. At the moment, you're leaving too much up to chance. Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data.

Use the io module to open the file in text mode and decode the file as it goes - no more .decode()! You need to make sure the encoding of your incoming data is consistent. You can either re-encode it externally or change the encoding in your script. Here's I've set the encoding to windows-1252.

with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:    for line in twitter_file:        # line is now a <type 'unicode'>        tweet = json.loads(line)

The io module also provide Universal Newlines. This means \r\n are detected as newlines, so you don't have to watch for them.

json python-2.7 utf-8 ascii python-unicode

The error occurs when you are trying to read a tweet containing sentence like

"@Mike http:\www.google.com \A8&^)((&() how are&^%()( you ". Which cannot be read as a String instead you are suppose to read it as raw String .but Converting to raw String Still gives error so i better i suggest you to

read a json file something like this:

import codecsimport json    with codecs.open('tweetfile','rU','utf-8') as f:             for line in f:                data=json.loads(line)                print data["tweet"]keys.append(data["id"])            fulldata.append(data["tweet"])

which will get you the data load from json file .

You can also write it to a csv using Pandas.

import pandas as pdoutput = pd.DataFrame( data={ "tweet":fulldata,"id":keys} )output.to_csv( "tweets.csv", index=False, quoting=1 )

Then read from csv to avoid the encoding and decoding problem

hope this will help you solving you problem.

Midhun

CodeHunter

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last