Python:Some Common Bugs(1)

1.Pandas read csv: ParserError: Error tokenizing data. C error: Expected 1 fields in line 29, saw 2

By default, CSV files are separated by commas. However, Commas are frequently used in Chinese, so it is easy to cause confusion when crawling over Chinese data. Therefore, when writing CSV files using pandas, we can set the parameter sep= '\t', that is, write it with TAB as the separator.After all, TAB is rarely used in Chinese. Then when you read the CSV file , you must add a parameter delimiter:

df=pd.read_csv('path',delimiter="\t")

Otherwise, an error will be reported when reading CSV: ParserError: Error tokenizing data. C error: Expected 1 fields in line 29, saw 2


2. TypeError: ‘list’ object is not callable

What happens if the list variable and the list function have the same name?

list = ['Bob', 'Tom', 'Lily', 'Jack']

tup_1 = (1, 2, 3, 4, 5)
tupToList = list(tup_1)

print(tupToList)

You’ll get an error like this:

Traceback (most recent call last):
  File "D:/python_workshop/python6/lesson3_list.py", line 6, in <module>
    tupToList = list(tup_1)
TypeError: 'list' object is not callable

Callable () is a built-in function in python to check whether an object is callable. In the above code, since the variable list and the function list are renamed, the function finds that the list is a well-defined list when the list function is called, so a type error is thrown.


3.UnicodeEncodeError: ‘gbk’ codec can’t encode character u’\xa0’ in position 4813: illegal multibyte sequence

If you want to read a file in gbk format, sometimes the above error will be reported.

Here, we need to figure out the relationship between gb2312, gbk and gb18030
GB2312: 6763 Chinese characters 
GBK: 21003 Chinese characters
GB18030-2000: 27533 Chinese characters
GB18030-2005: 70244 Chinese characters
Therefore, GBK is a superset of GB2312, and GB18030 is a superset of GBK.

You can try the following code to solve this bug:

encoding="gb18030"

If you want to convert utf8 format files to gbk format files, try the following code:

def ReadFile(filePath,encoding="utf-8"):
    with codecs.open(filePath,"r",encoding) as f:
        return f.read()

def WriteFile(filePath,u,encoding="gbk"):
    with codecs.open(filePath,"wb") as f:
        f.write(u.encode(encoding,errors="ignore"))

def UTF8_2_GBK(src,dst):
    content = ReadFile(src,encoding="utf-8")
    WriteFile(dst,content,encoding="gb18030")