How to handle errors and missing data in python

Specifying Data types

  • Use dtype keyword argument to specify column data types.

  • dtype takes a dictionary of column names and data types.

# Create dict specifying data types for agi_stub and zipcode
data_types = {'agi_stub':'category',
              'zipcode':'str'}

# Load csv using dtype to set correct data types
data = pd.read_csv("vt_tax_data_2016.csv", dtype=data_types)

# Print data types of resulting frame
print(data.dtypes.head())

Customising Missing data values

  • Use the na_values keyword argument to set custom missing values.

  • You can pass a single value, list, or dictionary of column names and values.

# Create dict specifying that 0s in zipcode are NA values
null_values = {'zipcode':0}

# Load csv using na_values keyword argument
data = pd.read_csv("vt_tax_data_2016.csv", 
                   na_values=null_values)

# View rows with NA ZIP codes
print(data[data.zipcode.isna()])

Lines with Error

  • set error_bad_lines = False to skip unparseable records.

  • set warn-bad-lines=True to see messages when records are skipped.

try:
  # Set warn_bad_lines to issue warnings about bad records
  data = pd.read_csv("vt_tax_data_2016_corrupt.csv", 
                     error_bad_lines=False, 
                     warn_bad_lines=True)

  # View first 5 records
  print(data.head())

except pd.errors.ParserError:
    print("Your data contained rows that could not be parsed.")