If you work in Technology, then like me you have all been there, large excel files with lots of data that need vlookups, cleansing, data removal etc. But if you are reading this blog then you are interesting in Linux, and in the command line. Although Excel is a great tool, its not my cup of tea. SQL is a much more powerful and faster tool to manage data. But do we want the trouble of adding our excel to a DB? No, we don’t want the hassle. But there are tools out there that can do SQL or SQL like commands against excel files. In this series I will take a look at some of these tools. Today we start with csvkit sql on csv files.
What is csvkit?
csvkit is a suite of command-line tools for converting to and working with CSV, the king of tabular file formats. It operates under the MIT licence, so is free to use and distribute under the same licence. It allows for converting excel files, splitting, joining, auditing and analysis’s of csv files. It also can dump a csv into a sqlite, postgres or mysql database. And the main topic, it allows you run sql command against csv files. So lets take a look at some of its commands and options by using some examples.
Set-Up
For this post I will be using a standard kali Linux based on Ubuntu 22.04. But any modern Linux distribution with Python3 and Python3-pip installed.
Install csvkit
To install csvkit we can use pip.
pip3 install csvkit
You needs some csv files to work with so download here or at my git hub page where I have converted them to excel.
csvkit Utilities
The csvkit package includes many utilities, csvlook, csvstat and csvsql. So lets explore some of these.
Lets convert our excel sheets to csv files.
in2csv ./100-Sales-Records.xlsx > 100-Sales-Records.csv
in2csv ./country-id.xlsx >country-id.csv
csvlook
We can take a quick look at the data with csvlook. However when displaying data from sql it tends to wrap around and makes it hard to read. We can disable that wrap around just to display the data but be aware it truncates the data.
setterm -linewrap off
#To turn it off run
reset
Lets take a look.
csvlook 100-Sales-Records.csv
csvcut
To list the column row names we use
csvcut -n ./100-Sales-Records.csv
csvcut can also display certain columns by there name using the “-c” option.
csvstat
We can also analysis the file details of each column.
csvstat ../100-Sales-Records.csv
SQL on CSV files
So now the part you have been waiting for. SQL on csv files. With csvkit there are two options. You can dump the entire csv file into a sqlite database and run your sql on on that. Or you can run the sql directly against the file.
First lets take a quick look at dumping it to a sqlite3 database file.
csvsql --db sqlite:///database.db --insert joined.csv
This creates a file in the local directory called database.db which can by loaded by the sqlite3 client. \
sqlite3 ./database.db
Using this method is best for large csv files. But you can also use sql directly on files if you only need a quick answer.
csvsql --query 'select Region, Country, Country_ID as ID from joined;' ./joined.csv | csvlook
But remember that for large csv files you will be loading them directly into memory and could have a performance impact on your system.
Wrap-Up
Using csvkit is a great end to end solution for managing csv files and giving you the power of sql to use against them. It can do the full packet, convert excel to csv, join multiple files, etc. Highly recommend.
Any comments or suggestions please leave them below.
Be First to Comment