Skip to content

csvkit sql on csv files

If you work in Technology, then like me you have all been there, large excel files with lots of data that need vlookups, cleansing, data removal etc. But if you are reading this blog then you are interesting in Linux, and in the command line. Although Excel is a great tool, its not my cup of tea. SQL is a much more powerful and faster tool to manage data. But do we want the trouble of adding our excel to a DB? No, we don’t want the hassle. But there are tools out there that can do SQL or SQL like commands against excel files. In this series I will take a look at some of these tools. Today we start with csvkit sql on csv files.

What is csvkit?

csvkit is a suite of command-line tools for converting to and working with CSV, the king of tabular file formats. It operates under the MIT licence, so is free to use and distribute under the same licence. It allows for converting excel files, splitting, joining, auditing and analysis’s of csv files. It also can dump a csv into a sqlite, postgres or mysql database. And the main topic, it allows you run sql command against csv files. So lets take a look at some of its commands and options by using some examples.

Set-Up

For this post I will be using a standard kali Linux based on Ubuntu 22.04. But any modern Linux distribution with Python3 and Python3-pip installed.

Install csvkit

To install csvkit we can use pip.

pip3 install csvkit

You needs some csv files to work with so download here or at my git hub page where I have converted them to excel.

csvkit Utilities

The csvkit package includes many utilities, csvlook, csvstat and csvsql. So lets explore some of these.

Lets convert our excel sheets to csv files.

in2csv ./100-Sales-Records.xlsx > 100-Sales-Records.csv  
in2csv ./country-id.xlsx >country-id.csv

csvlook

We can take a quick look at the data with csvlook. However when displaying data from sql it tends to wrap around and makes it hard to read. We can disable that wrap around just to display the data but be aware it truncates the data.

setterm -linewrap off
#To turn it off run
reset

Lets take a look.

csvlook 100-Sales-Records.csv
Truncated output!

csvcut

To list the column row names we use

csvcut -n ./100-Sales-Records.csv
Header Rows

csvcut can also display certain columns by there name using the “-c” option.

csvstat

We can also analysis the file details of each column.

csvstat ../100-Sales-Records.csv

SQL on CSV files

So now the part you have been waiting for. SQL on csv files. With csvkit there are two options. You can dump the entire csv file into a sqlite database and run your sql on on that. Or you can run the sql directly against the file.

First lets take a quick look at dumping it to a sqlite3 database file.

 csvsql --db sqlite:///database.db --insert joined.csv

This creates a file in the local directory called database.db which can by loaded by the sqlite3 client. \

sqlite3 ./database.db
Schema of csv file in sqlite

Using this method is best for large csv files. But you can also use sql directly on files if you only need a quick answer.

csvsql --query 'select Region, Country, Country_ID as ID  from joined;' ./joined.csv | csvlook 
sql on csv files

But remember that for large csv files you will be loading them directly into memory and could have a performance impact on your system.

Wrap-Up

Using csvkit is a great end to end solution for managing csv files and giving you the power of sql to use against them. It can do the full packet, convert excel to csv, join multiple files, etc. Highly recommend.

Any comments or suggestions please leave them below.

Published inLinuxSQL

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *