What is csvkit?
is a suit of command line tools
is developed by python by wireservice
offers data processing and cleaning capabilities on CSV files
has data compatibility that rivals Python, R and SQL.
csvkit installation
pip install csvkit
if it already exist and you would like to upgrade;
pip install --upgrade csvkit
Converting files to CSV using in2csv
in2csv is used to convert flat files like txt, xlsx, etc. to csv format. To get the documentation of in2csv:
in2csv --help
in2csv h
Syntax
in2csv spotifyData.xlsx > spotifyData.csv
This will convert the xlsx format to csv format. It is important to add the redirect function(>) to the command as in2csv spotifyData.xlsx will only convert the file and not parse it to a filename.
To convert a specify worksheet in a workbook in excel, you need to specify the sheet name. Use in2csv -n spotifyData.xlsx to print the sheet names in the workbook.
in2csv -n spotifyData.xlsx
Then, we use --sheet followed by the sheet name to be converted.
in2csv spotifyData.xlsx --sheet "Worksheetname" > spotifyData.csv
As this will only convert Worksheetname and redirect the conversion to spotifyData.csv. Please note that the result will not be printed in the console. To check if the file has been created: run ls
Preview data on the command line
csvlook: renders a CSV to the command line in a Markdown-compatible, fixed-width format.
Documentation
csvlook -h
Syntax
csvlook spotifyData.csv
prints the file on the command line.
csvstat: descriptive stats on csv files
csvstat: prints the descriptive summary statistics on all columns in CSV(e.g mean, median, unique value counts) which is similar to the describe method in pandas.
Documentation
csvstat -h
Recap
# Check to confirm name and location of the Excel data file
ls
# Convert sheet "Worksheet2_MusicAttributes" to CSV
in2csv SpotifyData.xlsx --sheet "Worksheet2_MusicAttributes" > Spotify_MusicAttributes.csv
# Check to confirm name and location of the new CSV file
ls
# Print preview of Spotify_MusicAttributes
csvlook Spotify_MusicAttributes.csv
Filtering data using csvkit
We can create a subset of the original dataset by:
filter the data by column
filter the data by row
csvcut: filters data using the column or position
Documentation
csvcut -h
Use --name or -n to list out all the column names and their positions.
csvcut -n Spotify_MusicAttribute.csv
prints all the column names and their position the file Spotify_MusicAttribute.csv
csvcut -c 1 Spotify_MusicAttribute.csv
Returns the first column in the data, by position
csvcut -c "track_id" Spotify_MusicAttribute.csv
Returns the first column in the data, by column name.
csvcut -c 2,3 Spotify_MusicAttribute.csv
Returns the second and third column in the data, by position. Note that there is no space between the numbers.
or we could grab the columns by names using:
csvcut -c "seconcolumn","third_column" Spotify_MusicAttribute.csv
csvgrep filter data by row value through exact match, pattern matching or even regex.
Filter data by row value
filters by row using exact match or regex fuzzy matching
must be paired with one of these options:
-m: followed by the exact row value to filter -r: followed by regex pattern -f: followed by the path to a file
Documentation
csvgrep -h
csvgrep -c "track_id" -m rowvalue Spotify_MusicAttribute.csv
or, we can use the column position
csvgrep -c 1 -m rowvalue Spotify_MusicAttribute.csv
Stacking data and chaining commands with csvkit
csvstack: stacks up the rows from two or more CSV files.
Documentation
csvstack -h
In order to stack rows, you need to ensure that the files have the same number of column , are arranged the same and they have the same data type.
Syntax
csvstack file1.csv file2.csv > file.csv
To keep record of what row belong to each dataset .
csvstack -g "Rank1","Rank2" file1.csv file2.csv > file.csv
We could rename the column from group by passing -n "source"
Chaining command-line commands
; links commands together and runs sequentially
csvlook SpotDataAll.csv; csvstat SpotDataAll.csv
&& links commands together, but only runs the second command if the first succeeds
csvlook SpotDataAll.csv && csvstat SpotDataAll.csv
> redirects the output from the first command to the second command
csvstack file1.csv file2.csv > file.csv
| used the output of the first command as an input in the second command
csvcut -c "track_id","danceability" Spotify_Popularity.csv | csvlook
prints the result of the first command well formatted.
Recap
# Convert the Spotify201809 tab into its own csv file
in2csv Spotify_201809_201810.xlsx --sheet "Spotify201809" > Spotify201809.csv
# Check to confirm name and location of data file
ls
# Preview file preview using a csvkit function
csvlook Spotify201809.csv
# Create a new csv with 2 columns: track_id and popularity
csvcut -c "track_id","popularity" Spotify201809.csv > Spotify201809_subset.csv
# While stacking the 2 files, create a data source column
csvstack -g "Sep2018","Oct2018" Spotify201809_subset.csv Spotify201810_subset.csv > Spotify_all_rankings.csv