» Data Manipulation and More with the Command Line

Data Manipulation and More with the Command Line

by Ku Wee Kiat, Research Computing, NUS Information Technology

Ever needed to have a directory of files renamed to a certain format? Extract lines with certain keywords from log files? Even create csv files from semi-structured logs?

There is no need to bust out the custom python or R scripts or install any software when most simple tasks can be solved at a much faster speed using Bash tools.

The Basics

Some Keywords and Symbols

● stdout
- ○ Standard Output
- ○ default file descriptor where a process can write output.
● stdin
- ○ Standard Input
- ○ default file descriptor where a process can read input.
● stderr
- ○ Standard Error
- ○ default file descriptor where a process can write error output
● |
- ○ Pipes
- ○ a pipe is a special file that connects the output of one process to the input of another process
- ○ e.g.: cat file1.txt | grep keyword
  - ◼ Find lines with “keyword” in file1.txt
● >
- ○ redirect, write
- ○ e.g.: cat file1.txt > file2.txt
  - ◼ Overwrite the contents of file2 with file1
● >>
- ○ redirect, append
- ○ e.g.: cat file1.txt >> file2.txt
  - ◼ Append the contents of file1 to file2
● $variableName
- ○ Access bash variable by adding ‘$’ in front of variable name
- ○ Assign variable in bash:
  - ◼ variableName=”some_string”
● man tool_name
- ○ Manual of a bash tool
- ○ e.g.: man ls
  - ◼ Prints Directory listing tool (ls) manual to terminal

Commands & Tools

ls – list directory contents

cat – concatenate files and print on the standard output

For concatenating the contents of one file to another:

cat file1.txt >> file2.txt

For overwriting the contents of one file with another

cat file1.txt > file2.txt

For copying the contents of multiple files (train1.txt, train2.txt, train3.txt )to a single file (mega_train.txt)

cat train*.txt >> mega_train.txt

head – output the first part of files

use -n 5 to read just 5 lines. Replace 5 with any other number.

Use this to peek at large files you do not want to open as a whole.

tail – output the last part of files

use -n 5 to read just 5 lines. Replace 5 with any other number.

grep, egrep, fgrep, rgrep – print lines matching a pattern

 grep [OPTIONS] PATTERN [FILE...]

 grep [OPTIONS] -e PATTERN ... [FILE...]

 grep [OPTIONS] -f FILE ... [FILE...]

grep can also be used with regular expressions:

grep -E ‘[0-9]’

diff – compare files line by line

# Remove the first line from our dataset

wk@6620G:~/proj2$ sed 1d train.csv >> train1.csv




# find the difference between our dataset and the new dataset

wk@6620G:~/proj2$ diff train.csv train1.csv

1d0

< PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked

find - search for files in a directory hierarchy

# Find all directories named src

find . -name src -type d

# Find all python files that have a folder named test in their path

find . -path '**/test/**/*.py' -type f

# Find all files modified in the last day

find . -mtime -1

# Find all zip files with size in range 500k to 10M

find . -size +500k -size -10M -name '*.tar.gz'

Chaining Commands with Pipes

sed – stream editor for filtering and transforming text

Replace space with _ using regular expressions

wc – print newline, word, and byte counts for each file

use -l option for counting lines

We can combine a few commands to accomplish a task.

For example, counting the number of lines in train.csv dataset that are females.

And for males:

The features available in the dataset are as shown:

Let’s say we just need the PassengerId and Name of the Passenger,

we can use the awk tool.

gawk/awk – pattern scanning and processing language

and if you need to save the output to a file, you can use “>” or “>>”.

For example:

cat train.csv | sed -u "s/, /_/g"  | awk -F "," '{print $1"," $4}' >> output.txt

We are using sed in the above example to replace the commas within names to and underscore as we specified awk to delimit by “,”, this might split names if we do not get rid of commas in them.

If we want to get rid of the first line which is the column names and just keep the data:

Use: sed nd where n is the nth line to delete

Other usage:

Delete a range of lines: sed n,md

Multiple deletes: sed 1;3;n,md

sort – sort lines of text files

Using sort we can sort column values as well

Sorting my names which is the 2nd column, columns separated by “,”:

Sort by age display only name and age columns:

cat train.csv | sed -u "s/, /_/g"  | awk -F "," '{print $4"," $6}' | sed 1d | grep -E '[0-9]' | sort --field-separator="," -k2 -g

uniq – report or omit repeated lines

If we want to get a list of all possible ages, we can use uniq:

cat train.csv | sed -u "s/, /_/g"  | awk -F "," '{print  $6}' | sed 1d | grep -E '[0-9]' | sort --field-separator="," -k2 -g | uniq | sort -g

cut – remove sections from each line of files

We can use cut tools to extract column(s) as well instead of awk

In the below example, we use cut to extract the name column, but before that we must substitute the commas in the name column with underscores as cut depends on commas as the delimiter. After substitution, we pipe the output into the cut command to retrieve only the name column and then reverse the substitution we did earlier using sed.

If you want to ensure there’s no duplicates, you can always pipe the out from the above command into sort and then uniq to get unique values.

Look out for subsequent installments for more tips and tricks.

Tagged 2020-05 hpc_at_nus_newsletter