How to import flat files

How to import flat files

How to import csv file in python

# Import pandas as pd
import pandas as pd

# Read the CSV and assign it to the variable data
data = pd.read_csv('vt_tax_data_2016.csv')

# View the first few lines of data
print(data.head())

How to import tsv file in python

# Import pandas with the alias pd
import pandas as pd

# Load TSV using the sep keyword argument to set delimiter
data = pd.read_csv('vt_tax_data_2016.tsv', sep='\t')

# Plot the total number of tax returns by income group
counts = data.groupby("agi_stub").N1.sum()
counts.plot.bar()
plt.show()

Modify flat files imports in python

Limiting Columns

  • You can limit the columns you want to import using usecols keyword argument.

  • You can choose to accept a list of columns names or numbers you want imported

# Create list of columns to use
cols = ['zipcode', 'agi_stub', 'mars1', 'MARS2', 'NUMDEP']

# Create dataframe from csv using only selected columns
data = pd.read_csv("vt_tax_data_2016.csv", usecols=cols)

# View counts of dependents and tax returns by income level
print(data.groupby("agi_stub").sum())

Limiting Rows

  • Limit the number of rows you load using nrows keyword argument.
# Create dataframe of next 500 rows with labeled columns
vt_data_next500 = pd.read_csv("vt_tax_data_2016.csv", 
                                 nrows=500,
                                 skiprows=500,
                                 header=None)

# View the Vermont dataframes to confirm they're different
print(vt_data_first500.head())
print(vt_data_next500.head())
  • Use nrows and skiprows together to process a file in chunks.

  • skiprows accepts a list of row numbers, a number of rows or a function to filter rows.

  • Set header=None so pandas knows there are no column names.

Assigning column names

  • Supply the list of column names to keyword argument names.

  • The list created must be equal to the number of column names in the dataframe.

  • If you only want to rename some columns, it should be done after the import.

# Create dataframe of next 500 rows with labeled columns
vt_data_next500 = pd.read_csv("vt_tax_data_2016.csv", 
                                 nrows=500,
                                 skiprows=500,
                                 header=None,
                                 names=list(vt_data_first500))

# View the Vermont dataframes to confirm they're different
print(vt_data_first500.head())
print(vt_data_next500.head())