Getting Started with csvkit

Getting Started with csvkit

What is csvkit?

  • is a suit of command line tools

  • is developed by python by wireservice

  • offers data processing and cleaning capabilities on CSV files

  • has data compatibility that rivals Python, R and SQL.

csvkit installation

pip install csvkit

if it already exist and you would like to upgrade;

pip install --upgrade csvkit

Converting files to CSV using in2csv

in2csv is used to convert flat files like txt, xlsx, etc. to csv format. To get the documentation of in2csv:

in2csv --help
in2csv h

Syntax

in2csv spotifyData.xlsx > spotifyData.csv

This will convert the xlsx format to csv format. It is important to add the redirect function(>) to the command as in2csv spotifyData.xlsx will only convert the file and not parse it to a filename.

To convert a specify worksheet in a workbook in excel, you need to specify the sheet name. Use in2csv -n spotifyData.xlsx to print the sheet names in the workbook.

in2csv -n spotifyData.xlsx

Then, we use --sheet followed by the sheet name to be converted.

in2csv spotifyData.xlsx --sheet "Worksheetname" > spotifyData.csv

As this will only convert Worksheetname and redirect the conversion to spotifyData.csv. Please note that the result will not be printed in the console. To check if the file has been created: run ls

Preview data on the command line

csvlook: renders a CSV to the command line in a Markdown-compatible, fixed-width format.

Documentation

csvlook -h

Syntax

csvlook spotifyData.csv

prints the file on the command line.

csvstat: descriptive stats on csv files

csvstat: prints the descriptive summary statistics on all columns in CSV(e.g mean, median, unique value counts) which is similar to the describe method in pandas.

Documentation

csvstat -h

Recap

# Check to confirm name and location of the Excel data file
ls

# Convert sheet "Worksheet2_MusicAttributes" to CSV
in2csv SpotifyData.xlsx --sheet "Worksheet2_MusicAttributes" > Spotify_MusicAttributes.csv

# Check to confirm name and location of the new CSV file
ls

# Print preview of Spotify_MusicAttributes
csvlook Spotify_MusicAttributes.csv

Filtering data using csvkit

We can create a subset of the original dataset by:

  • filter the data by column

  • filter the data by row

csvcut: filters data using the column or position

Documentation

csvcut -h

Use --name or -n to list out all the column names and their positions.

csvcut -n Spotify_MusicAttribute.csv

prints all the column names and their position the file Spotify_MusicAttribute.csv

csvcut -c 1 Spotify_MusicAttribute.csv

Returns the first column in the data, by position

csvcut -c "track_id" Spotify_MusicAttribute.csv

Returns the first column in the data, by column name.

csvcut -c 2,3 Spotify_MusicAttribute.csv

Returns the second and third column in the data, by position. Note that there is no space between the numbers.

or we could grab the columns by names using:

csvcut -c "seconcolumn","third_column" Spotify_MusicAttribute.csv

csvgrep filter data by row value through exact match, pattern matching or even regex.

Filter data by row value

  • filters by row using exact match or regex fuzzy matching

  • must be paired with one of these options:

-m: followed by the exact row value to filter -r: followed by regex pattern -f: followed by the path to a file

Documentation

csvgrep -h
csvgrep -c "track_id" -m rowvalue Spotify_MusicAttribute.csv

or, we can use the column position

csvgrep -c 1 -m rowvalue Spotify_MusicAttribute.csv

Stacking data and chaining commands with csvkit

csvstack: stacks up the rows from two or more CSV files.

Documentation

csvstack -h

In order to stack rows, you need to ensure that the files have the same number of column , are arranged the same and they have the same data type.

Syntax

csvstack file1.csv file2.csv > file.csv

To keep record of what row belong to each dataset .

csvstack -g "Rank1","Rank2" file1.csv file2.csv > file.csv

Screenshot 2022-08-07 at 14.26.30.png

We could rename the column from group by passing -n "source"

Screenshot 2022-08-07 at 14.28.20.png

Chaining command-line commands

; links commands together and runs sequentially

csvlook SpotDataAll.csv; csvstat SpotDataAll.csv

&& links commands together, but only runs the second command if the first succeeds

csvlook SpotDataAll.csv && csvstat SpotDataAll.csv

> redirects the output from the first command to the second command

csvstack file1.csv file2.csv > file.csv

| used the output of the first command as an input in the second command

csvcut -c "track_id","danceability" Spotify_Popularity.csv | csvlook

prints the result of the first command well formatted.

Recap

# Convert the Spotify201809 tab into its own csv file 
in2csv Spotify_201809_201810.xlsx --sheet "Spotify201809" > Spotify201809.csv

# Check to confirm name and location of data file
ls

# Preview file preview using a csvkit function
csvlook Spotify201809.csv

# Create a new csv with 2 columns: track_id and popularity
csvcut -c "track_id","popularity" Spotify201809.csv > Spotify201809_subset.csv

# While stacking the 2 files, create a data source column
csvstack -g "Sep2018","Oct2018" Spotify201809_subset.csv Spotify201810_subset.csv > Spotify_all_rankings.csv