How to import csv file in python
# Import pandas as pd
import pandas as pd
# Read the CSV and assign it to the variable data
data = pd.read_csv('vt_tax_data_2016.csv')
# View the first few lines of data
print(data.head())
How to import tsv file in python
# Import pandas with the alias pd
import pandas as pd
# Load TSV using the sep keyword argument to set delimiter
data = pd.read_csv('vt_tax_data_2016.tsv', sep='\t')
# Plot the total number of tax returns by income group
counts = data.groupby("agi_stub").N1.sum()
counts.plot.bar()
plt.show()
Modify flat files imports in python
Limiting Columns
You can limit the columns you want to import using usecols keyword argument.
You can choose to accept a list of columns names or numbers you want imported
# Create list of columns to use
cols = ['zipcode', 'agi_stub', 'mars1', 'MARS2', 'NUMDEP']
# Create dataframe from csv using only selected columns
data = pd.read_csv("vt_tax_data_2016.csv", usecols=cols)
# View counts of dependents and tax returns by income level
print(data.groupby("agi_stub").sum())
Limiting Rows
- Limit the number of rows you load using nrows keyword argument.
# Create dataframe of next 500 rows with labeled columns
vt_data_next500 = pd.read_csv("vt_tax_data_2016.csv",
nrows=500,
skiprows=500,
header=None)
# View the Vermont dataframes to confirm they're different
print(vt_data_first500.head())
print(vt_data_next500.head())
Use nrows and skiprows together to process a file in chunks.
skiprows accepts a list of row numbers, a number of rows or a function to filter rows.
Set header=None so pandas knows there are no column names.
Assigning column names
Supply the list of column names to keyword argument names.
The list created must be equal to the number of column names in the dataframe.
If you only want to rename some columns, it should be done after the import.
# Create dataframe of next 500 rows with labeled columns
vt_data_next500 = pd.read_csv("vt_tax_data_2016.csv",
nrows=500,
skiprows=500,
header=None,
names=list(vt_data_first500))
# View the Vermont dataframes to confirm they're different
print(vt_data_first500.head())
print(vt_data_next500.head())