pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python hadoop hadoop

pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python


It's to do with uploading of stop words module. As a work around import stopwords library with in the function itself. please see the similar issue linked below.I had the same issue and this work around fixed the problem.

    def stopwords_delete(word_list):        from nltk.corpus import stopwords        filtered_words=[]        print word_list

Similar Issue

I would recommend from pyspark.ml.feature import StopWordsRemover as permanent fix.


Probably, it's just because you are defining the stopwords.words('english') every time on the executor. Define it outside and this would work.


You are using map over a rdd which has only one row and each word as a column.so, the entire row of rdd which is of type is passed to stopwords_delete fuction and in the for loop within that, is trying to match rdd to stopwords and it fails.Try like this,

filtered_words=stopwords_delete(wordlist.flatMap(lambda x:x).collect())print(filtered_words)

I got this output as filtered_words,

["shan't", "she'd", 'fuck', 'world', "who's"]

Also, include a return in your function.

Another way, you could use list comprehension to replace the stopwords_delete fuction,

filtered_words = wordlist.flatMap(lambda x:[i for i in x if i not in stopwords.words('english')]).collect()