How to handle errors and missing data in python
Specifying Data types
Use dtype keyword argument to specify column data types.
dtype takes a dictionary of column names and data types.
# Create dict specifying data types for agi_stub and zipcode
data_types = {'agi_stub':'category',
'zipcode':'str'}
# Load csv using dtype to set correct data types
data = pd.read_csv("vt_tax_data_2016.csv", dtype=data_types)
# Print data types of resulting frame
print(data.dtypes.head())
Customising Missing data values
Use the na_values keyword argument to set custom missing values.
You can pass a single value, list, or dictionary of column names and values.
# Create dict specifying that 0s in zipcode are NA values
null_values = {'zipcode':0}
# Load csv using na_values keyword argument
data = pd.read_csv("vt_tax_data_2016.csv",
na_values=null_values)
# View rows with NA ZIP codes
print(data[data.zipcode.isna()])
Lines with Error
set error_bad_lines = False to skip unparseable records.
set warn-bad-lines=True to see messages when records are skipped.
try:
# Set warn_bad_lines to issue warnings about bad records
data = pd.read_csv("vt_tax_data_2016_corrupt.csv",
error_bad_lines=False,
warn_bad_lines=True)
# View first 5 records
print(data.head())
except pd.errors.ParserError:
print("Your data contained rows that could not be parsed.")