Example blog

How to Work More Efficiently with Log Files Using Linux csvkit

All IT professionals and incident managers have to deal with log files from various sources almost daily. Learn how to work faster and more efficiently to get the most out of CSV files with csvkit on Linux.

Image: BEST-BACKGROUNDS / Shutterstock

Must-read developer content

CSV files are often imported into Excel or LibreOffice software before being used and analyzed. It is very practical and comfortable, as long as the files are not too big. But some log files can contain billions of rows, making it impossible to import them into spreadsheets. Or you may need to scan files remotely on headless servers without any ability to use a graphical user interface.

SEE: Hiring Kit: JavaScript Developer (TechRepublic Premium)

Fortunately, a simple solution is available on all Linux operating systems: the csvkit software.

How to install csvkit

Since the tool is available in standard repositories, it is extremely easy to install. In this article, we will be using an operating system based on Ubuntu.

Let’s run the installation in a command line shell by running:

sudo apt install csvkit

That’s it. The system now installs the tool and all necessary dependencies.

How to work on a CSV file

To illustrate our point, we are going to work on a SimpleMaps.com CSV file containing a list of cities and information about them: country, longitude, latitude, population, etc.

The first line of the CSV file shows the different column names, as is often the case with CSV files. This can be seen with the “head” command, which displays by default the first 10 lines of a file (Figure A).

Figure A

figa.jpg

The header of the CSV file.

How to understand the columns of the file

Now let’s start using csvcut from the command line, one of the tools built into the csvkit. Running the following command will automatically display named columns and indices (Number B):

csvcut -n 

Number B

figb.jpg

Using csvcut to list the columns in the file.

We could then use the indexes or column names to remedy this.

How to display the selected columns

One of the most common operations when processing CSV files is to select a few columns or rearrange the columns.

To display only a few columns, let’s again use the csvcut command with the -c option. Both command lines work, to show how to use both the subscripts or the column name. In our example, we’ll use the head command again with a pipe, just to display the first few rows of the results (Figure C).

csvcut -c 1,5,10 
csvcut -c city,country,population 

Figure C

figc.jpg

An output with a few columns selected.

If we want to add row numbers to the output, the -l option comes to the rescue and adds a new column named line_number to our output (Number D).

Number D

figd.jpg

Adding a row number to the output results.

The output can of course be redirected to a new file. To do this, we redirect the output to a file using the> character. From our previous example:

csvcut -l -c city,country,population worldcities.csv > newfile.csv

How to change the order of the columns

Using csvcut we can also create output that rearranges the columns. All we need is to specify the columns, and the tool will display it accordingly (Figure E).

Figure E

freeze.jpg

How to sort data with csvsort

You can sort the data using the csvsort command. Similar to csvcut, csvsort allows the use of the -n option to list columns and -c to use either the column index or the column name.

By default, csvsort works in ascending mode, but it is possible to use the -r option to sort in descending mode.

Let’s sort our file by country name, in descending order (Figure F):

csvsort -r -c country worldwities.csv

Figure F

figf.jpg

Results sorted by country name in descending order.

You can sort multiple columns: just use them with the -c option (Number G). The following line will sort our data in descending mode by country and by population:

csvsort -r -c country,population worldcities.csv

Number G

figg.jpg

Results sorted with multiple columns.

How to combine csvcut and csvsort

Csvsort is powerful but it still generates all the columns. By combining csvcut and csvsort, we can perform any type of output or sort.

For example, let’s extract only the city name, country name, latitude, longitude and sort these columns by latitude (Number H).

csvcut -c city,country,lat worldcities.csv | csvsort -c lat

Number H

combat.jpg

Combine csvcut and csvsort.

How to get a more enjoyable ride

If you want nicer output, the csvlook command allows you to render the CSV output in a Markdown-compatible fixed-width format.

From our previous example, we just run the csvlook command at the end of our line (Number I):

csvcut -c city,country,lat worldcities.csv | csvsort -c lat | csvlook

Number I

figi.jpg

Results of the csvlook command.

How to get statistics with csvstat

The csvstat command can be used to obtain various statistics on the CSV file.

Executed with no arguments except the filename, it provides detailed statistics for each column. It is also possible to use the -c option to display the selected columns (Figure J).

csvstat -c country 

Figure J

figj.jpg

Statistics on the “country” column.

It is possible to adjust the output of the command using different options.

To extract the unique values ​​from the country column, we can use the –unique (Number K).

Number K

figk.jpg

The number of unique countries using csvstat.

For a list of all csvstat options, please enter the following command:

csvstat -h

Csvkit contains several different command line tools that make it easy for IT professionals and people who need to work with large CSV files to do so from the command line. The ability to combine these tools including csvcut and csvsort makes it very powerful and should meet all professional needs.

Moreover, it is also possible to use csvkit to convert XLS and JSON files to CSV before parsing them or using them with command line tools.

Disclosure: I work for Trend Micro, but the opinions expressed in this article are my own.

Also look


Source link