All IT professionals and incident managers have to deal with log files from various sources almost daily. Learn how to work faster and more efficiently to get the most out of CSV files with csvkit on Linux.
CSV files are often imported into Excel or LibreOffice software before being used and analyzed. It is very practical and comfortable, as long as the files are not too big. But some log files can contain billions of rows, making it impossible to import them into spreadsheets. Or you may need to scan files remotely on headless servers without any ability to use a graphical user interface.
Fortunately, a simple solution is available on all Linux operating systems: the csvkit software.
How to install csvkit
Since the tool is available in standard repositories, it is extremely easy to install. In this article, we will be using an operating system based on Ubuntu.
Let’s run the installation in a command line shell by running:
sudo apt install csvkit
That’s it. The system now installs the tool and all necessary dependencies.
How to work on a CSV file
To illustrate our point, we are going to work on a SimpleMaps.com CSV file containing a list of cities and information about them: country, longitude, latitude, population, etc.
The first line of the CSV file shows the different column names, as is often the case with CSV files. This can be seen with the “head” command, which displays by default the first 10 lines of a file (Figure A).
How to understand the columns of the file
Now let’s start using csvcut from the command line, one of the tools built into the csvkit. Running the following command will automatically display named columns and indices (Number B):
We could then use the indexes or column names to remedy this.
How to display the selected columns
One of the most common operations when processing CSV files is to select a few columns or rearrange the columns.
To display only a few columns, let’s again use the csvcut command with the -c option. Both command lines work, to show how to use both the subscripts or the column name. In our example, we’ll use the head command again with a pipe, just to display the first few rows of the results (Figure C).
csvcut -c 1,5,10
csvcut -c city,country,population
If we want to add row numbers to the output, the -l option comes to the rescue and adds a new column named line_number to our output (Number D).
The output can of course be redirected to a new file. To do this, we redirect the output to a file using the> character. From our previous example:
csvcut -l -c city,country,population worldcities.csv > newfile.csv
How to change the order of the columns
Using csvcut we can also create output that rearranges the columns. All we need is to specify the columns, and the tool will display it accordingly (Figure E).
How to sort data with csvsort
You can sort the data using the csvsort command. Similar to csvcut, csvsort allows the use of the -n option to list columns and -c to use either the column index or the column name.
By default, csvsort works in ascending mode, but it is possible to use the -r option to sort in descending mode.
Let’s sort our file by country name, in descending order (Figure F):
csvsort -r -c country worldwities.csv
You can sort multiple columns: just use them with the -c option (Number G). The following line will sort our data in descending mode by country and by population:
csvsort -r -c country,population worldcities.csv
How to combine csvcut and csvsort
Csvsort is powerful but it still generates all the columns. By combining csvcut and csvsort, we can perform any type of output or sort.
For example, let’s extract only the city name, country name, latitude, longitude and sort these columns by latitude (Number H).
csvcut -c city,country,lat worldcities.csv | csvsort -c lat
How to get a more enjoyable ride
If you want nicer output, the csvlook command allows you to render the CSV output in a Markdown-compatible fixed-width format.
From our previous example, we just run the csvlook command at the end of our line (Number I):
csvcut -c city,country,lat worldcities.csv | csvsort -c lat | csvlook
How to get statistics with csvstat
The csvstat command can be used to obtain various statistics on the CSV file.
Executed with no arguments except the filename, it provides detailed statistics for each column. It is also possible to use the -c option to display the selected columns (Figure J).
csvstat -c country
It is possible to adjust the output of the command using different options.
To extract the unique values from the country column, we can use the –unique (Number K).
For a list of all csvstat options, please enter the following command:
Csvkit contains several different command line tools that make it easy for IT professionals and people who need to work with large CSV files to do so from the command line. The ability to combine these tools including csvcut and csvsort makes it very powerful and should meet all professional needs.
Moreover, it is also possible to use csvkit to convert XLS and JSON files to CSV before parsing them or using them with command line tools.
Disclosure: I work for Trend Micro, but the opinions expressed in this article are my own.