1.8: Data Manipulation

Introduction

In this chapter, let us discuss ways to manipulate the data from files; operations such as sort, slice, join, reformat, aggregation. etc.. I have originally named this chapter as File Manipulation Commands, however none of the commands discussed here actually change the input file, these commands just manipulate the contents and display the output in stdout that can be redirected into a file.

This subset of commands are one of the most frequently used commands and gaining expertise using these commands will be really beneficial to developers, testers and anyone who work with files. Along with the commands used for viewing files, IO Redirection and these data manipulation commands, we can craft command pipelines to extract, reformat and store data in files without writing complex programs.

Sample Files

In order to explore the commands, we will use the following 2 files; a fixed format file and a delimited file with same data. Feel free to use these or your own samples with the commands and options specified. A python script to create these files are available at the end of the chapter. In addition to these two files, we will also use few smaller files and the contents of these files will be displayed before being used in the demo.

student.csv

studentId,LastName,FirstName,DOB,grade,subject,score,endOfRec
S00001,Carlson,Crystal,SOCS,15-May-2013,3,98,Z
S00002,Brennan,Angel,MATH,21-Feb-2013,3,74,Z
...
...
S00004,Brooks,Jillian,MATH,10-Dec-2010,6,62,Z

student.txt

S00001 Carlson    Crystal    15-May-2013 SOCS 03  98 Z
S00002 Brennan    Angel      21-Feb-2013 MATH 03  74 Z
...
...
S00004 Brooks     Jillian    10-Dec-2010 MATH 06  62 Z

Command List: Data Manipulation

# Name Description
1 cut create slices from each record
2 paste merge multiple lines from a file into single line or merge multiple files line by line
3 split split a file into smaller chunks. By default, each split file contains 1000 lines, we can use options to split by different line count or split by bytes size
4 join merge already sorted files by keys, By default it performs an inner or equality join, only display records with matching keys
5 sort sort files, a rich set of options are available
6 uniq create unique records from an already sorted file
7 tr simple translation and deletion of characters
8 column display file in tabular format
9 colrm remove columns from a file
10 cmp compare two files byte by byte and display summary, from which byte/line there is a difference. We can limit the number of bytes to be compared
11 diff compare two sorted files and display the records that are different
12 comm compare two sorted files and display the unique records from file 1, file 2 and matched records in 3 columns. This command is useful to compare files with shorter records
13 cksum create a checksum and byte count of a given file
14 md5sum create a checksum using MD5 digest algorithm
15 shasum create a checksum using SHA aka Secure Hash Algorithm
16 shuf shuffle records from file

cut : create slices from each record

The cut command is used to extract portions of reach record to create a subset of a given file. The subset of records can be created providing column range: col 1-10,25-30,50-end or by fields numbers if we have a delimited file.

Another use of the cut command is to convert a delimited file from one-delimiter to another. for example CSV to pipe-delimited file.

Option Description
-c column positions to extract
-d CH delimiter to be used; CH should be a single character. used along with -f. default is TAB
-f fields to extract; works with -d for option
-s skip lines if that does not have delimiters. works with -d option
--complement select columns/fields that are not mentioned in -c or -f options
--output-delim use a different delimiter on output records. By default, input delimiter is used

Note: While using -f or -c, we need to provide the individual column positions or field numbers or range or a combination of both. We can use the following conventions

  • m-n : column or field from m to n
  • m,n : column or field m and n
  • m- : from m till end of the record

Examples

  • -c1-10,25 : columns 1 to 10 and column 25 (11 bytes)
  • -d, -f1-5,10 : fields 1 to 5 and field 10 in a comma-delimited record
  • -c1-10,25,30,40-45,50-: columns 1 to 10, 25, 30, 40 to 45 and 50 till end of record
  • -d, -f1,5,11-15,30-: fields 1, 5, 10 to 15 and from field 30 to the last field in a comma-delimited record

extract last and first names from student.txt

The data is in fixed format, each field start at the same column position. Each field has a space in between and each student has 3 records; one per subject - MATH, SCIE and SOCS.

  • last name: column 8 to 17, 10 bytes long
  • first name: column 19 to 28, 10 bytes long
# sample data
$ head -2 student.txt
S00001 Kim        Ivan       19-Nov-2012 MATH 04  99 Z
S00001 Kim        Ivan       19-Nov-2012 SCIE 04  99 Z
S00002 Pena       Bryan      07-Jun-2013 MATH 03  57 Z

$ cut -c8-17,19-28 student.txt
Kim       Ivan
Kim       Ivan
Pena      Bryan
...
...
Rosario   Emily

extract last and first names from student.csv

# sample data
$ head -3 student.csv
Id,LastName,FirstName,DOB,grade,subject,score,En
S00001,Kim,Ivan,MATH,19-Nov-2012,04,99,Z
S00001,Kim,Ivan,SCIE,19-Nov-2012,04,99,Z

# delimiter is comma, get fields 2 and 3
$ cut -d, -f2,3 student.csv
LastName,FirstName
Kim,Ivan
Kim,Ivan
...
...
Rosario,Emily

extract fields other than first and last name

# --complement: get fields other than listed in '-c' of '-f'
$ cut -d, -f2,3 --complement student.csv 
Id,DOB,grade,subject,score,En
S00001,MATH,19-Nov-2012,04,99,Z
S00001,SCIE,19-Nov-2012,04,99,Z
...
...
S00003,SOCS,21-Apr-2010,06,57,Z

convert CSV to pipe-delimited file

# -f1-: select all fields
# --output-delim='|' : set separate output delimiter 
$ cut -d, -f1- --output-delim='|' student.csv
Id|LastName|FirstName|DOB|grade|subject|score|En
S00001|Kim|Ivan|MATH|19-Nov-2012|04|99|Z
S00001|Kim|Ivan|SCIE|19-Nov-2012|04|99|Z
S00001|Kim|Ivan|SOCS|19-Nov-2012|04|75|Z
...
...
S00030|Dewey|Jack|SOCS|21-Apr-2010|06|57|Z

convert CSV to TAB delimited file

The TAB as output-delimiter works slightly different than any other single-character delimiter. we need to use --output-delim-$'\t' instead of just \t.

# let us skip header 
$ tail -n +2 student.csv | cut -d, -f1-  --output-delim=$'\t'  
S00001  Kim     Ivan    MATH    19-Nov-2012     04      99      Z
...
...
S00003  Rosario Emily   SOCS    21-Apr-2010     06      57      Z

skip records that does not have delimiters

Some delimited files will have blank lines at the end along with summary of record count, exported from database tables for example. we can skip the blank records and the summary record using the -s option.

$ head student.csv
Id,LastName,FirstName,DOB,grade,subject,score,En
S00001,Kim,Ivan,MATH,19-Nov-2012,04,99,Z
...
...
S00100,Newton,Charles,SOCS,26-Jun-2012,04,92,Z

300 record(s) exported from STUDENT table
# last 3 records discarded, remove `-s` and try this command
$ cut -d, -f1- -s student.csv
Id,LastName,FirstName,DOB,grade,subject,score,En
S00001,Kim,Ivan,MATH,19-Nov-2012,04,99,Z
S00001,Kim,Ivan,SCIE,19-Nov-2012,04,99,Z
...
...
S00100,Newton,Charles,SOCS,26-Jun-2012,04,92,Z

Note:

There might be situations where we need to slice and rearrange columns or fields. The cut command is suitable only for slicing and does not support rearranging columns or fields. We will look into this using another command called awk in the future.

paste : merge multiple files

The paste command typically reads one record from each file, passed as argument and merge the records with a space delimiter and create a single record. If any of these files have less records compared to the other an empty-string is substituted in place.

Option Description
-d CH output delimiter, default is TAB
-s paste records as single record from one file at a time, like transpose

Examples

We have 2 files cmds.txt that contains the commands list and help.txt that contains the command description. The help.txt has one extra record than the cmds.txt. Let us merge the contents of these two files.

**Snippet from sample files **

$ cat cmd.txt
echo
cat
head
tail

$ cat help.txt
display argument on file
concatenate files
display first 10 records
display last 10 records

last record has contents from help.txt only

$ paste cmd.txt help.txt
echo    display argument on file
cat     concatenate files
head    display first 10 records
tail    display last 10 records
paste   merge records from files
        sort files

# using delimiter 
echo,display argument on file
cat,concatenate files
head,display first 10 records
tail,display last 10 records
paste,merge records from files
,sort files

Let us say, we have an address file with 3 lines per address and we want to merge the 3 lines into one and create one record per address. We can use paste to merge every 3 records from the address file into single record.

We need to pass the data as standard input or pipe in order to perform the multiple records merge

$ head -6 address.py
Brooks, Sarah
52848 Cynthia Drives
West Jason, MS 82831
Byrd, Kelly
88505 Castro Heights
South Vanessa, MA 96024

# single line address as TAB delimited record
$ head -9 address.py  | paste - - -
Brooks, Sarah   52848 Cynthia Drives    West Jason, MS 82831
Byrd, Kelly     88505 Castro Heights    South Vanessa, MA 96024

# same as above, as comma-delimited record
$ head -9 address.py  | paste -d, - - -
Brooks, Sarah,52848 Cynthia Drives,West Jason, MS 82831
Byrd, Kelly,88505 Castro Heights,South Vanessa, MA 96024

split: split a file into pieces by size or record count

The split commands creates smaller chunks from a large file that can be concatenated together to recreate the same file, if needed. In the past, due to the storage size limitations, we may not be able to copy large files in one removable storage devices like floppy, CD,… The split command made such copy possible.

We can create smaller file chunks by specifying number of records per file, max file size or even number of chunks we want. By default, the command splits the file by creating chunks of files with 1000 records per file and the last file to have the left over records.

In addition to various split options, we can provide file prefix and suffix to be used to name the file chunks. If we do not provide static prefix or suffix, the command will use x as prefix and aa, ab, ac,.. as suffix creating file names xaa, xab, xac,..

syntax

$ split [option(s)] FILE [prefix]

common options

Option Description
-b N create chunks of N bytes each
-l N create chunks with N records per file
-n N create N files of equal size, may create empty file
-e do not create empty file; used along with -n
-a N suffix length; default is 2: aa, ab, ac, …
-d use numeric suffix increments instead of aa, ab,..
-x use hexadecimal suffix increments instead of aa, ab, ..

other options

Option Description
--additional-suffix= provide static suffix
--numeric-suffixes= same as -d, allows starting value
--hex-suffixes= same as -d, allows starting value

examples

Let us a very large file using the names_v1.txt file and shuf command with -n and -r option. The following command created a 1.4GB file and it may take some time to process. We will also create another file with 7550 records to demo split by record count.

$ shuf -r -n 123456789  names_v1.txt -o large_file_split_demo.txt

$ zip large_file_split_demo large_file_split_demo.txt

# file size: 1.4G and 73M respectively
$ stat --printf="%n: %s\n" large_file_split_demo.*               
large_file_split_demo.txt: 1481481468
large_file_split_demo.zip: 76052943

$ head -7550  large_file_split_demo.txt > split_demo.txt
$ wc -l split_demo.txt
7550 split_demo.txt

default split

By default, the split command creates file chunks with 1000 records per file.

$ ls
large_file_split_demo.txt  
large_file_split_demo.zip  
names_v1.txt  
split_demo.txt

# created files with 'x' prefix and 2 byte alpha suffix
$ split split_demo.txt

$ ls
large_file_split_demo.txt  names_v1.txt    xaa  xac  xae  xag
large_file_split_demo.zip  split_demo.txt  xab  xad  xaf  xah

$ wc -l split_demo.txt xaa xab xaf xah
  7550 split_demo.txt
  1000 xaa
  1000 xab
  1000 xac
  1000 xad
  1000 xae
  1000 xaf
  1000 xag
   550 xah

max records per file

$ rm x*
# split -l 3000: max records per file is 3000 
$ split -l 3000 split_demo.txt

$ ls
large_file_split_demo.txt  names_v1.txt    split_fileaa  split_fileac  xab
large_file_split_demo.zip  split_demo.txt    
xaa           xab   xac

$ wc -l xa*
 3000 xaa
 3000 xab
 1550 xac
 7550 total 

using prefix / suffix for output files

# pass prefix as the second argument: 'split_file_' 
$ split -l 3000 split_demo.txt split_file_

$ wc -l split_file_*
 3000 split_file_aa
 3000 split_file_ab
 1550 split_file_ac
 7550 total
 
# add suffix
$ split -l 3000 split_demo.txt split_file_ --additional-suffix="_chunk.txt"

$ wc -l *chunk*
 3000 split_file_aa_chunk.txt
 3000 split_file_ab_chunk.txt
 1550 split_file_ac_chunk.txt

split into N files

We can use the -n NUM option to specify how may chunks we want to create.

$ split -n 4  split_demo.txt split_file_ --additional-suffix="_chunk.txt"
$ wc -l *chunk*
 1887 split_file_aa_chunk.txt
 1888 split_file_ab_chunk.txt
 1887 split_file_ac_chunk.txt
 1888 split_file_ad_chunk.txt

split by size

Let us use the -b N option to create files of max N Bytes size, We can suffix K, M, G, T to specify size in KB, MB, etc.. For this demo, we will use the 73 MB zip file.

We will use the -d N to use numeric suffix instead of alphabetic and -a N option to provide suffix size instead of default 2 bytes. The numeric suffix starts with 0 and incremented by 1 when we use the -d option. If we need the suffix to start with some other number, we can use the --numeric-suffixes=NNN option.

check the zip file details

$ ls -lh large_file_split_demo.zip
-rw-r--r-- 1 mktutes mktutes 73M Aug 11 05:52 large_file_split_demo.zip

# zipinfo: provides info about the zipped files
# like the ls -l command
$ zipinfo large_file_split_demo.zip
Archive:  large_file_split_demo.zip
Zip file size: 76052943 bytes, number of entries: 1
-rw-r--r--  3.0 unx 1481481468 tx defN 21-Aug-11 05:49 large_file_split_demo.txt
1 file, 1481481468 bytes uncompressed, 76052743 bytes compressed:  94.9%

split the files by 10MB chunk

# use numeric suffix of 5 bytes long
# output file prefix is "num_"
$ split -d  -a 5 -b 10M large_file_split_demo.zip num_

$ ls -lh num_*
-rw-r--r-- 1 mktutes mktutes  10M Aug 11 07:56 num_00000
-rw-r--r-- 1 mktutes mktutes  10M Aug 11 07:56 num_00001
-rw-r--r-- 1 mktutes mktutes  10M Aug 11 07:56 num_00002
-rw-r--r-- 1 mktutes mktutes  10M Aug 11 07:56 num_00003
-rw-r--r-- 1 mktutes mktutes  10M Aug 11 07:56 num_00004
-rw-r--r-- 1 mktutes mktutes  10M Aug 11 07:56 num_00005
-rw-r--r-- 1 mktutes mktutes  10M Aug 11 07:56 num_00006
-rw-r--r-- 1 mktutes mktutes 2.6M Aug 11 07:56 num_00007

concat file chunks and check the integrity of the zip file

$ cat num_0000* > temp.zip

$ ls -lh temp.zip
-rw-r--r-- 1 mktutes mktutes 73M Aug 11 07:59 temp.zip

$ md5sum *.zip
a6d0061a44673d9493f238850eb20c5a  large_file_split_demo.zip
a6d0061a44673d9493f238850eb20c5a  temp.zip

$ zipinfo temp.zip
Archive:  temp.zip
Zip file size: 76052943 bytes, number of entries: 1
-rw-r--r--  3.0 unx 1481481468 tx defN 21-Aug-11 05:49 large_file_split_demo.txt
1 file, 1481481468 bytes uncompressed, 76052743 bytes compressed:  94.9%

join: merge two sorted files using a key

The join command merges two lexically sorted by a common field. By default, it merges the matched records from both files. We can also write unmatched records from one or both files using additional options. By default, the output will have the join key followed by remaining fields from FILE1 and then FILE2.

The command uses space as default field delimiter and considers the first field as a key for comparison. We can override the delimiter and specify different field numbers as comparison key.

The join command works similar to the JOIN operation of SQL queries. However it only uses a single field for matching.

Option Description
-a 1 print unmatched key and data from FILE1; left outer join
-a 2 print unmatched key and data from FILE2; right outer join
-a 1 -a 2 : prints unmatched records from both files; full outer join
-v 1 suppress matched records from FILE 1
-v 2 suppress matched records from FILE 2
-v 1 -v 2: suppress matched records from both files
-t -t CH:use CH as input field delimiter instead of space
-i ignore case while comparing fields; sort using -f option
-j N use FIELD N from FILE 1 and 2 for comparison instead of first field
-1 N use FIELD N from FILE 1 for comparison
-2 N use FIELD N from FILE 2 for comparison
-e STR print STR against missing values
we need to use -o auto along with -a and -e
-o FMT override default merge order
-o 0 to display on the keys. -o 1.2 2.2 to display individual fields
use FILENUM.FIELDNUM,... to rearrange fields
--header treat first line as header

sample files

For this demo, let us create two files with numbers in ascending order as first field. The roman numeral and numbers spelling as second fields on file one and two respectively. These files are named roman.txt and english.txt.

We will use --header option to treat the first record from both files as header.

$ cat roman.txt
NBR Roman
1 I
2 II
3 III
4 IV
5 V

$ cat english.txt
NUM Text
0 Zero
1 One
2 Two
3 Three
4 Four

inner join: also known as natural join (default)

The join command without any options merges the two files using first field delimited by spaces and output the matched records from both files. The unmatched records from both files will be skipped. In this example, last record; 5 V from roman.txt and first record; 0 Zero from english.txt will be skipped.

$ join --header roman.txt english.txt
NBR Roman Text
1 I One
2 II Two
3 III Three
4 IV Four

**left outer join: print matched records and unmatched from FILE 1

The -a 1 option prints the unmatched record from FILE 1 along with the matched records. By default, the unmatched records will not have any data for the FILE 2 columns. If we want to print default values, we need to use the -e VAL and -o auto options.

$ join --header -a 1 roman.txt english.txt
NBR Roman Text
1 I One
2 II Two
3 III Three
4 IV Four
5 V

# print NA for missing values; see record: `5 V NA`
$ join --header -a 1 -e NA -o auto roman.txt english.txt
NBR Roman Text
1 I One
2 II Two
3 III Three
4 IV Four
5 V NA

**right outer join: print matched records and unmatched from FILE 2

The -a 2 option prints matched records and unmatched records from FILE 2. Missing values from FILE 1 against the unmatched records will display NA. Look at record 0 NA Zero.

$ join --header -a 2 -e NA -o auto roman.txt english.txt
NBR Roman Text
0 NA Zero
1 I One
2 II Two
3 III Three
4 IV Four

full outer join: print matched and unmatched records from both files

$ join --header -a 1 -a 2 -e NA -o auto roman.txt english.txt
NBR Roman Text
0 NA Zero
1 I One
2 II Two
3 III Three
4 IV Four
5 V NA

extract unmatched records only

Sometimes, we might be interested in missing records; for example, if we run a same code twice with expectations that we should get same number of records with matching fields. We can verify this expectation by running the join command with no options. If the record count of both files are same and the count of output records matches with input count then our expectation is met. However if we see differences in record counts between input files or input and output, we need to know which keys did not match between the two files.

We can use the -v N option to suppress the matched records and -a N to display the unmatched records from FILE N (N=1 or 2). We can get the unmatched records from one of the files or both files. Furthermore, if we are just interested in only the keys, we can display the keys alone using the -o option.

# join:
#   --header: treat first record as header and use if for display
#   -v 1 : suppress matching records
#   -a 1 : display unmatched records from FILE 1
#   -o 0 : display only the keys
$ join --header -v 1 -a 1 -o 0 roman.txt english.txt
NBR
5

# unmatched records from FILE 2: -v 2 -a 2
$ join --header -v 2  -a 2 -o 0 roman.txt english.txt
NBR
0

# unmatched records from both files
$ join --header -v 2  -a 2 -a 1 -o 0 roman.txt english.txt
NBR
0
5

override input field delimiter and specify matching field numbers

If the files we use in join has a different delimiter, then we can use the -t CHAR to override the default delimiter (space). Both input files should have the same delimiter.

Let us create a roman.csv file as a comma-delimited file and swap field 1 and 2. Let us also create another file called num_string.csv with numbers spelt in English, German as field 1 and 2 and matching field (arabic numerals) as field 3.

$ head -2 roman.csv
Roman,NBR
I,1

$ head -2 num_string.csv
English,German,NUM
Zero,Null,0

# join -t: input delim, -1 N and -2 N : matching field numbers
$ join --header -t, -1 2 -2 3 roman.csv num_string.csv
NBR,Roman,English,German
1,I,One,Eins
2,II,Two,Zwei
3,III,Three,Drei
4,IV,Four,Vier

reorder output fields

In the above example, the output of join will be formatted as follows; key, fields from file 1 followed by fields from file 2. This can be overridden using the -o option where 0 represents the key field and FILE_NBR:FIELD_NBR to specify the individual fields. For example -o 0 1.2 2.1 is to print key followed by 2nd field from file 1 and 1st field from file 2.

$ $ join --header -t, -1 2 -2 3 -o "0 2.1 1.1 2.2"  roman.csv num_string.csv
NBR,English,Roman,German
1,One,I,Eins
2,Two,II,Zwei
3,Three,III,Drei
4,Four,IV,Vier

case-insensitive comparison

When join compares the matching fields, it does a case-sensitive comparison, that makes the uppercase and lowercase letters as distinct entities. Sometimes, we may not care about the case of the alphabets in the key. We can use the -i option to enable case-insensitive comparison.

If we use -i, the files have to be sorted by keeping uppercase and lowercase values together. See the -f option in sort command in the next section for more info.

The matching key will be from the first file passed as argument to the join command

For the case-insensitive comparison, let use replace the numbers with alphabets; uppercase in one file and lowercase in another file and create a new set of files.

$ head -3 alpha_roman.txt
NBR Roman
b II
c III

$ head -3 alpha_num.txt
NUM Text
A One
B Two

# inner join: no matches due to case difference
$ join --header alpha_roman.txt alpha_num.txt
NBR Roman Text

# key will be taken from file 1
$ join --header -i alpha_roman.txt alpha_num.txt
NBR Roman Text
b II Two
c III Three
d IV Four

$ join --header -i alpha_num.txt alpha_roman.txt
NUM Text Roman
B Two II
C Three III
D Four IV

sort: sort a file

Some of the commands we discussed so far; uniq and join and few more commands that we will discuss later; file comparison commands for example expect the input file(s) to be sorted. Understanding how the sort command works is very important and gaining expertise in using the command is very important for anyone who want to gain expertise using the CLI.

The sort command by default, sorts the records starting from column 1 all the way to the last column. It uses the ASCII value of the characters on a given column to decide the order in which the records to be sorted. For example, uppercase A will be sorted before lowercase a because of their ASCII values; 65 and 97 respectively. The records are sorted in ascending order by default.

The sort command comes with an extensive collection options to control how the records within a file will be sorted. We will look at the commonly used options and once we become familiar with these options, we can focus on other options.

Here are the sorting options

Option Description
-d dictionary order; consider blanks and alphanumeric characters only
-n numeric sort; instead of column-by-column comparison, treat a sequence of numbers as single entity for sorting
-h human-readable sorting on byte size; 10K, 2M, 1G etc..
-M month sort; JAN, FEB, MAR, etc…
-k specify field numbers / column position as sort keys
-f case-insensitive sort
some versions of sort does case-insensitive sort by default
-r sort in descending order; reverse sort
-R random sort; shuffle the records but keep sort keys of same value together
-m merge already sorted files together
-u unique sort; keep only the first occurrence of record based on the sort key

Here are some options to override the default configuration settings.

Option Description
-t CH override default field separator (space) with CH
-T DIR override default temp directory used (/tmp)
-o FILE write sorted output to FILE
-c check whether a file is already sorted or not
error message if not sorted
-C same as -c but suppress the error message
use echo $? to check; 0=OK, 1=Not OK

sort a names file: case-sensitive and case-insensitive

Sorting of alphabets is usually case-sensitive, that is uppercase alphabets are sorted before the lowercase due to their ASCII values; uppercase alphabets have smaller ASCII value compared to the lowercase alphabets. We can use the -f option to perform a case-insensitive sort.

Depending on the sort command version, the case-insensitive comparison may be done by default. sort command in google cloud shell does an case-insensitive sort whereas sort command in Macbook does a case-sensitive sort.

$ cat names.txt
Macias
gerald
Huang
Gordon
diana
Daugherty
Cook
# mac os: case-sensitive sort by default
$ sort names.txt
Cook
Daugherty
Gordon
Huang
Macias
diana
gerald

# google cloud shell: case-insensitive sort by default
$ sort names.txt
Cook
Daugherty
diana
gerald
Gordon
Huang
Macias

If we want to sort case-sensitive explicitly, we can set the LOCALE environment variable as LC_ALL=C.

# case-sensitive sort: setting LC_ALL=C env variable
# LC_ALL: Language LOCALE settings
# uppercase before lowercase
$ LC_ALL=C sort names.txt
Cook
Daugherty
Gordon
HuangMacias
diana
gerald

# case-insensitive sort
$ LC_ALL=C sort -f names.txt

reverse sort using -r

By default, the records are sorted in ascending order and we can use -r to reverse sort in descending order.

# reverse sort along with case-fold / case-insensitive sort
$ sort -f -r names.txt
Macias
Huang
Gordon
gerald
diana
Daugherty
Cook

numeric sort

By default, sort performs an ASCII sort, comparing the ASCII value of the characters in each column to arrange the records in ascending order. If two records have same value in a given column, sort moves to the next column and performs the comparison.

The -n option overrides this behaviour and forces sort to treat consecutive numbers separated; a field to be treated as a whole in numeric form rather than sequence of characters. Let us sort a file with random number and see how the -n impacts the way sorting occurs.

$ cat rand_nums.txt
22675
81
91
...
..
3
9
8634

default sort; column wise sort that grouped all 1 together at column 1 and sorted ascending on column 2, …

$ sort rand_nums.txt
116
151400
199
22057
22675
3
3149
81
8634
88
9
91

numeric sort: set of characters treated as numbers

$ sort -n rand_nums.txt
3
9
81
88
91
116
199
3149
8634
22057
22675
151400

sorting human-readable file size: KB, MB, GB, TB, …

In order to sort by file-size displayed in human-readable form, sort provides -h option. If we use the ls option to display the file list by size, we can use ls -lSh or ls -lShr to apply sort in the command itself. Sometimes, we may get the details as a report or a file. In that case, we will use -h along with other commands like cut to extract file name and size to sort by file size.

$ sort rand_file_size.txt
1G
1T
20K
2G
2K
2M
3.4K
987

$ sort -h rand_file_size.txt
987
2K
3.4K
20K
2M
1G
2G
1T

sort using portion of a file

The sort command has the -k or --key=KEYDEF option to sort the file using parts of the records. If it is fixed-format file, we can use start and end column numbers. For delimited files, we can specify the delimiter using -t and list the field numbers by which we want the file to be sorted. We can specify more one set of keys.

The -k accepts argument FIELD_NBR.COL_POS with optional sort type like h, n that will be local to the specific key alone. For example -k1.9,1.12n -k1.1,1.5 means perform a numeric sort using columns 9-12 then sort by columns 1-5. For the fixed format files the FIELD_NBR is always 1.

For the delimited files, we can use the field numbers and optionally column numbers within the field itself. For example, -k2 -k5n -k1 means sort by the second field, fifth field numerically and the first field. -k2.6,2.10 means sort by the second field but consider only the column 6-10 from field 2.

Here is the student file we have used in cut demo. Let us sort the file with following criteria

  1. grade: field 6
  2. score: field 7 numeric sort, reverse
  3. data-of birth: year, month date - key and month sort
  4. Last Name
  5. First Name
$ cat student_math.csv
S00001,Huang,Amanda,30-NOV-2010,MATH,06,58,Z
S00002,Sherman,Cody,23-MAY-2009,MATH,07,89,Z
S00003,Gordon,Heidi,09-NOV-2009,MATH,07,87,Z
S00004,Gomez,Kelly,29-APR-2010,MATH,06,92,Z
S00005,Cook,Lisa,28-SEP-2011,MATH,05,83,Z
S00006,Daugherty,Joshua,05-APR-2011,MATH,05,94,Z
S00007,Macias,Emma,21-SEP-2009,MATH,07,79,Z

Here is the command with options to perform different types of sorting supported by the sort command; numeric, reverse, month, keys etc.. You may create your own file with duplicate values for any of these fields and try it out for better understanding.

# -t,        : comma-delimited file
# -k6,6      : #6 is grade
# -k7,7rn    : #7 is score; numeric and descending
# -k4.8,4.11 : #4 is DOB, sort by year @pos 8-11 of field 4
# -k4.4,4.6M : sort by Month in 'MMM' using '-m' option
# -k4.1,4.2  : sort by date
# -k2,2 -k3.3: last name and first name 
$ sort -t, -k6,6 -k7,7rn -k4.8,4.11 -k4.4,4.6M -k4.1,4.2 -k2,2 -k3.3 student_math.csv 

S00082,Ross,Scott,05-FEB-2010,MATH,06,85,Z
S00017,Stone,James,30-MAR-2010,MATH,06,84,Z
S00041,Rush,Kristina,14-AUG-2010,MATH,06,84,Z
S00078,Golden,Valerie,10-JAN-2010,MATH,06,83,Z
S00058,Ayala,Keith,29-FEB-2010,MATH,06,83,Z

uniq: get unique values, counts from “sorted” data

The uniq command at its simplest form will display unique values by removing the duplicate entries. In order for the command to work, the input should be in sorted order. Using additional options, the uniq command can

  • return the one occurrence of entries that has duplicates
  • return all the duplicates, not just one entries
  • count the number of occurrences of each entry
Option Description
-u return entries that does not have any duplicates
-d return one occurrence per entry that has duplicates
-D return all occurrences of entries with duplicates
-c count the number of occurrences of each entry

less commonly used options:

These options can home in handy if we need to check uniqueness only on part of record. We can skip first N bytes or fields not in the middle, I have documented these options for awareness and if interested, please explore on your own

Option Description
-i ignore case while comparing
-f N skip N fields (separated by spaces or TAB)
-s N skip N characters from the start for comparison
-w N consider only N characters for comparison

get uniq subject list from student.txt

The subject code is 4 bytes long string starting at position 42. Here are the steps

  1. extract the subject code using the cut command
  2. sort the data extract from cut
  3. pass it to uniq
$ head -3 student.txt
S00001 Kim        Ivan       19-Nov-2012 MATH 04  99 Z
S00001 Kim        Ivan       19-Nov-2012 SCIE 04  99 Z
S00001 Kim        Ivan       19-Nov-2012 SOCS 04  88 Z

# get subject list
$ cut -c42-45 student.txt | sort | uniq
MATH
SCIE
SOCS

get number of records by grade from student.txt

The grade is 2 bytes long string starting at position 47. Here are the steps

  1. extract the grade using cut command
  2. sort the grades
  3. pass it to uniq with -c option
  4. the results will be displayed as COUNTS GRADE
# grade 03 has 51 records, ...
$ cut -c47-48 student.txt | sort | uniq -c
     51 03
     60 04
     60 05
     72 06
     57 07

find students with same last name

The student.txt file has multiple records per student; one per subject that needs additional command pipeline. Here are the steps

  1. extract student id, last name and first name using cut
  2. sort the results and get uniq combination of these 3 fields
  3. extract last name using cut
  4. sort the last names
  5. pass the sorted last names to uniq with -d option
$ cut -c1-28 student.txt | sort | uniq | cut -c8-17 | sort | uniq -d
Evans
Hernandez
Jones
May
Miller
Rivera
Ward

# The file has 2 students with same last name: WARD
S00065 Ward       Daniel
S00093 Ward       Marcus

find students with uniq and duplicate last names

We need to do the same steps from the previous example except for the last step. We will add two more commands to cut -c1-28 student.txt | sort | uniq | cut -c8-17 | sort

  1. for students with unique last names, we will use uniq -u
  2. for students with shared last names, we will use uniq -D
  3. since these two may produce a long list, we will pipe the results to wc -l to just get the count
  4. we can verify the counts against the unique students on the original file and compare it with the sum of uniq + shared last name counts
# get count of students with unique last name
$ cut -c1-28 student.txt | sort | uniq | cut -c8-17 | sort | uniq -u | wc -l
84

# get counts of students with shared last names
$ cut -c1-28 student.txt | sort | uniq | cut -c8-17 | sort | uniq -D | wc -l
16

# get record counts of student.txt
$ cut -c1-28 student.txt | sort | uniq | wc -l
100

# 100 == 84 + 16 ---> MATCHES !!!

tr: substitute, delete or squeeze characters

The tr command is an abbreviation for translate is used to perform data manipulation at individual character level. We can do the following using tr

  1. substitute one character with another
  2. delete a set of characters
  3. delete everything other than a given set of characters
  4. squeeze or remote duplicate characters that are next to each other

The tr command accepts input from stdin or pipe only

Option Description
-d SET delete from input all the characters from SET
-c complement, when used with -d as -dc SET, it deletes all the characters that are not in SET
-s SET remove duplicates of all characters in SET that are next to each other
-t SET1 SET2 while substituting characters from SET1 by SET2, truncate SET1 to match the length of SET2

convert uppercase to lowercase

$ echo "HELLO World" | tr 'A-Z' 'a-z'
hello world

$ echo "HELLO World" | tr [[:upper:]] [[:lower:]]
hello world

substitution cipher: ROT13

The tr command can be used to create simple ciphers by substituting one character with another. This is also called as caesar cipher as it was reportedly used by Julius Caesar himself. The ROT13 is a special case of caesar cipher that shifts the character position by 13; a by n, b by o,.. We can use tr to create ciphers like this.

Note: The substituted text will revert itself if we process the text with the tr command that created it in the first place. This can be cracked with repeated attempts using the tr command itself.

$ echo "Hello World" | tr 'A-Za-z' 'N-ZA-Mn-za-m'
Uryyb Jbeyq

# translate to lowercase first and create cipher
$ echo "Hello World" | tr 'A-Z' 'a-z' | tr 'a-z' 'n-za-m'
uryyb jbeyq

# get the original text back
$ echo "uryyb jbeyq" | tr 'a-z' 'n-za-m'
hello world

$ echo "hello world" | tr 'a-z' 'n-za-m' | tr 'a-z' 'n-za-m'
hello world

The character sets used, a-z and n-za-m should be of same length and here a-z means starting from ASCII value of a to z that is 26 lowercase alphabets and n-za-m also has same 26 alphabets.

What happens if we have less characters in SET1 than SET2? The extract characters will be discarded and it will not have any impact. On the other hand, if we have more characters in SET1 compared to SET2, the last character in SET2 will be repeated to match the length of SET1.

For example tr 'aeiou' 'AE' will be treated as tr 'aeiou' 'AEEEE'. If we do not want the default behaviour then we can use the -t flag that truncates the excess characters in SET1. For example tr -t 'aeiou' 'AE' will be treated as tr -t'ae' 'AE'.

$ echo "education" | tr 'aeiou' 'AE'
EdEcAtEEn

$  echo "education" | tr -t 'aeiou' 'AE'
EducAtion

# use case: replace all numbers with *
$ echo 'Current Salary is "65000.00 USD"' | tr '0-9' '*'
Current Salary is "*****.** USD"

delete characters

The -d SET option removes all the characters from SET whereas the -d -c SET will remove all the characters that are not in SET.

$ echo "education" | tr -d 'aeiou'
dctn

$ echo "education" | tr -dc 'aeiou'
euaio

squeeze characters The -s SET option removes the duplicate characters when they are repeated. we can use it to remove extra spaces in sentences for example.

$ echo "hello     world"
hello     world
$ echo "hello     world" | tr -s ' '
hello world

column: format data in tabular form

The column command is very useful if we want out output to be formatted as column-oriented or tabular output. This comes handy while viewing delimited files as each field may be of different length, viewing and understanding the data may pose difficulties

Option Description
-s SET set of delimiters to be used to split record to fields
-t determine the number of columns in the input first

pretty print CSV file in tabular format

$ head -5 student.csv | cut -d, -f1-7
Id,LastName,FirstName,DOB,grade,subject,score
S00001,Kim,Ivan,MATH,19-Nov-2012,04,99
S00001,Kim,Ivan,SCIE,19-Nov-2012,04,99
S00001,Kim,Ivan,SOCS,19-Nov-2012,04,88
S00002,Pena,Bryan,MATH,07-Jun-2013,03,57

$ head -5 student.csv | cut -d, -f1-7 | column -s, -t
Id      LastName  FirstName  DOB   grade        subject  score
S00001  Kim       Ivan       MATH  19-Nov-2012  04       99
S00001  Kim       Ivan       SCIE  19-Nov-2012  04       99
S00001  Kim       Ivan       SOCS  19-Nov-2012  04       88
S00002  Pena      Bryan      MATH  07-Jun-2013  03       57

colrm: remove columns from file and print

This is a simpler version of the cut command the remove columns based on the start and stop index provided as argument. If only one number is provided, it will be considered as stop index and the command displays record from column 1 to one column before the index. If both start and stop indices are provided then the command removes the columns from start to the stop. The command accepts a file or input from stdin for processing

The index starts with 1 not 0

We can just remove one set of contiguous columns

using fixed format text file

# sample file
$ head -5 student.txt
S00001 Huang      Amanda     30-Nov-2010 MATH 06  58 Z
S00002 Sherman    Cody       23-May-2009 MATH 07  89 Z
S00003 Gordon     Heidi      09-Nov-2009 MATH 07  87 Z
S00004 Gomez      Kelly      29-Apr-2010 MATH 06  92 Z
S00005 Cook       Lisa       28-Sep-2011 MATH 05  83 Z
# get student id, last and first name
# extract col 1-29
$ head -5 student.txt | colrm 30
S00001 Huang      Amanda
S00002 Sherman    Cody
S00003 Gordon     Heidi
S00004 Gomez      Kelly
S00005 Cook       Lisa

# discard data of birth and display other columns
$ head -5 student.txt | colrm 30 41
S00001 Huang      Amanda     MATH 06  58 Z
S00002 Sherman    Cody       MATH 07  89 Z
S00003 Gordon     Heidi      MATH 07  87 Z
S00004 Gomez      Kelly      MATH 06  92 Z
S00005 Cook       Lisa       MATH 05  83 Z

using STDIN

# get column 1-4 alone
$ echo 'abcdefghijklm' | colrm 5
abcd

# remove column 5-8: efgh
$ echo 'abcdefghijklm' | colrm 5 8
abcdijklm

# same as above; using cut command
$ echo 'abcdefghijklm' | cut -c5-8  --complement
abcdijklm

cmp: compare two files byte by byte

cmp is one of the 3 file comparison command we will discuss in this chapter. This is the simplest command of the 3 that compares two files byte-by-byte. If both files are same, it displays nothing otherwise a message that says files differ at byte N, Line N.

cmp command display the line number and the position where the difference starts.

The command sets return code 0 if both files are same, 1 if there is any difference and 2 if there are any issues with the argument or options. Use the echo $? command after running cmp to display the return code.

Option Description
-i N skip N bytes from both files before comparison
-i M:N skip M bytes from FILE 1 and
N bytes from FILE 2 before comparison
-n N consider only N bytes for comparison
-s suppresses normal output;
error messages will still be displayed.
use echo $? to find the status of comparison
-l verbose displays byte at which the command found the difference
and the value at byte number from both files

cmp is one of the rare commands that DOES NOT use -v for turning on verbosity. It uses -l instead. -v is used to display the version of the cmp command

Examples:

sample files

$ cat cmp_demo_01.txt
NumBERS: 0123456789X

$ cat cmp_demo_02.txt
Numbers: 0123 5 7  X

$ cat cmp_demo_03.txt
NUMS: 0123456789X

basic comparison

# compare as it is
# difference at byte 4 of line 1; 'B' in file 1 and 'b' in file 2
$ cmp cmp_demo_01.txt cmp_demo_02.txt
cmp_demo_01.txt cmp_demo_02.txt differ: byte 4, line 1

# -i 8: skip first 8 bytes from both files
# 14th byte has '3' in file 1 and ' ' in file 2
$ cmp -i 8 cmp_demo_01.txt cmp_demo_02.txt 
cmp_demo_01.txt cmp_demo_02.txt differ: byte 6, line 1

# same command but in silent mode
# use 'echo $?' to display return code
$ cmp -i 8 -s cmp_demo_01.txt cmp_demo_02.txt 

# rc: 0=files are same, 1=files are different
$ echo $?
1

# -n 5: compare only 5 bytes after skipping first 8 bytes
# 5 bytes match 
$ cmp -i 8 -n 5 cmp_demo_01.txt cmp_demo_02.txt 
$ echo $?
0

more comparison

$ cat cmp_demo_01.txt cmp_demo_03.txt
NumBERS: 0123456789X
NUMS: 0123456789X

# -i M:N : skip M bytes from file 1 and N bytes from file 2
# skip "NumBER" from file 1 and "NUM" from file 2
$ cmp -i 6:3 cmp_demo_01.txt cmp_demo_03.txt

# matched
$ echo $?
0

The -l option provides the byte numbers where there are difference and the value from both files. The example below has 6 64 40. This means at byte 6, file 1 has 64 and 40 as internal representation of 4 and spaces. This is not exactly the ASCII value. For more info, look at Little and Big Endian representation of data

# -l: verbose
# skip some bytes and compare files
# the command compares "0123456789X" and "0123 5 7  X"
$ cmp -i8 -l cmp_demo_01.txt cmp_demo_02.txt
 6  64  40
 8  66  40
10  70  40
11  71  40

diff: compare files line by line

The diff command is much more sophisticated than the command. It compares two files line-by-line and display the lines that are different. For better comparison, both files should be sorted. There is a diff3 command like diff but it compares 3 files instead of 2.

The command sets return code 0, if both files are identical and 1 if there are differences. We can use the echo $? command to display the return code after running the diff command.

Note: The output of the diff command can be used with another command called patch to merge the differences between the two files and update the original file. Discussion on patch is beyond the scope.

Option Description
-s display a message when both files are identical; prints nothing by default
-q display a message when both files differ instead of printing the actual differences
-i ignore case of alphabets during comparison
-N treat file not found as empty file
-b ignore whitespaces
-y print the differences side-by-side; we can use sdiff instead of diff -y

Other option(s)

--suppress-common-lines can be used with -y to suppress matching records

Examples:

Both files has same contents except few records that has all uppercase. File 2 has one extra record

$ cat student_file_v1.txt
S00001 HUANG      AMANDA     30-NOV-2010 MATH 06  58 Z
S00002 Sherman    Cody       23-May-2009 MATH 07  89 Z
S00003 Gordon     Heidi      09-Nov-2009 MATH 07  87 Z
S00004 GOMEZ      KELLY      29-APR-2010 MATH 06  92 Z
S00005 Cook       Lisa       28-Sep-2011 MATH 05  83 Z
S00006 Daugherty  Joshua     05-Apr-2011 MATH 05  94 Z
S00007 MACIAS     EMMA       21-SEP-2009 MATH 07  79 Z

$ cat student_file_v2.txt
S00001 Huang      Amanda     30-Nov-2010 MATH 06  58 Z
S00002 Sherman    Cody       23-May-2009 MATH 07  89 Z
S00003 Gordon     Heidi      09-Nov-2009 MATH 07  87 Z
S00004 Gomez      Kelly      29-Apr-2010 MATH 06  92 Z
S00005 Cook       Lisa       28-Sep-2011 MATH 05  83 Z
S00006 Daugherty  Joshua     05-Apr-2011 MATH 05  94 Z
S00007 Macias     Emma       21-Sep-2009 MATH 07  79 Z
S00008 Molina     William    21-Sep-2011 MATH 05  57 Z

$ cp -v student_file_v2.txt student_file_v2.txt.bk
'student_file_v2.txt' -> 'student_file_v2.txt.bk'

diff without any options

By default, diff performs a case-sensitive comparison and displays records that are different. In this case records 1, 4 and 7 are different.

Records with difference from FILE 1 are prefixed with < and FILE 2 are prefixed with >. The --- is a placeholder to indicate that there are matched records in between.

$ diff student_file_v1.txt student_file_v2.txt
1c1
< S00001 HUANG      AMANDA     30-NOV-2010 MATH 06  58 Z
---
> S00001 Huang      Amanda     30-Nov-2010 MATH 06  58 Z
4c4
< S00004 GOMEZ      KELLY      29-APR-2010 MATH 06  92 Z
---
> S00004 Gomez      Kelly      29-Apr-2010 MATH 06  92 Z
7c7
< S00007 MACIAS     EMMA       21-SEP-2009 MATH 07  79 Z
---
> S00007 Macias     Emma       21-Sep-2009 MATH 07  79 Z
> S00008 Molina     William    21-Sep-2011 MATH 05  57 Z

diff -q: display a message if both files are different

By default diff prints the differences between two files. If we are not concerned about the actual differences, we can use the -q option to print a message when there are differences.

$ diff -q student_file_v1.txt student_file_v2.txt
Files student_file_v1.txt and student_file_v2.txt differ

$ echo $?
1

diff -s: display a message if both files are identical

# no output if files are identical: default
$ diff student_file_v2.txt student_file_v2.txt.bk

# -s: display a message if both files are identical
$ diff -s student_file_v2.txt student_file_v2.txt.bk
Files student_file_v2.txt and student_file_v2.txt.bk are identical

Discard the output and inspect $?

The -s and -q commands display the status as message when the files are identical and different respectively. However since we will not know in advance whether the files are going to match or not. If we do not need the actual differences, we can redirect the output of the diff command to /dev/null, a special device file that acts like a Recycle Bin. We call it as bitbucket or black hole

# discard differences 
$ diff student_file_v1.txt student_file_v2.txt > /dev/null

# echo $?: 0=identical, 1=differences
$ echo $?
1

# compare original and backup; no messages if files are identical
$ diff student_file_v2.txt student_file_v2.txt.bk

# echo $?: 0=identical, 1=differences
$ echo $?
0

diff -i: case-insensitive comparison

The first example we saw, the diff command without any option performs a case-sensitive comparison. At times, the only difference between two files might be the case of the alphabet. For example, version 1 of the file might have names in title case and version 2 of the file is changed to have names in uppercase. If we want to compare these files and find out the real differences such as extra or missing records or differences in fields, we can use the -i option to perform a case-insensitive comparison.

# records 1, 4 and 7 matched with -i option
$ diff -i student_file_v1.txt student_file_v2.txt
7a8
> S00008 Molina     William    21-Sep-2011 MATH 05  57 Z

diff -N: Dealing with File Not Found issues

By default, if one of the file passed as argument is non-existing, the diff command displays an error message and set return code 2. The -N option instructs diff to treat non-existent file as an empty files.

$ diff student_file_v4.txt student_file_v1.txt
diff: student_file_v4.txt: No such file or directory

$ diff -N student_file_v4.txt student_file_v1.txt
0a1,7
> S00001 HUANG      AMANDA     30-NOV-2010 MATH 06  58 Z
> S00002 Sherman    Cody       23-May-2009 MATH 07  89 Z

diff -b: Dealing with leading and trailing blanks

Let us extract the last names from the student_file_v*.txt and compare the files. We will cut 10 bytes from one file and 11 bytes from another file and run diff with and without -b option.

# add an empty line and extract last names from v1 file
$ cut -c8-18 student_file_v1.txt  > names_v1.txt

# extract last names from v1 file and add an empty line
$ cut -c8-17 student_file_v2.txt > names_v2.txt

# display one record from each file and show newline as $
$ head -1 names_v* | cat -E
==> names_v1.txt <==$
HUANG      $
==> names_v2.txt <==$
Huang     $

# compare case-insensitive and ignore leading/trailing blanks
# one extra record "Molina" in file 2 and 
# one blank rec each
$ diff -bi  names_v1.txt names_v2.txt
1d0
<
8a8,10
> Molina
>

diff -y: side by side comparison

We can use diff -y or sdiff to compare two files and display the matched and unmatched records side-by-side. Unlike the default diff, the -y option displays both matched and unmatched records. If we do not want to see the matched records, we can use the --suppress-matched-lines option along with -y.

$ diff -y names_v1.txt names_v2.txt
              | Huang
HUANG         | Sherman
Sherman       | Gordon
Gordon        | Gomez
GOMEZ         | Cook
Cook          | Daugherty
Daugherty     | Macias
MACIAS        |
              > Molina
              > 

# suppress trailing blanks and case-insensitive compare
# discard matched records
# one blank records in each file and one extra record
$ diff -b -y -i --suppress-common-lines   names_v1.txt names_v2.txt
             <
             >
             > Molina

comm: compare 2 sorted files side-by-side

The comm command is a good choice to compare files with short records like name lists. It compares the files and displays a 3 column output; column 1 contains records unique to FILE 1, column 2 is unique records from FILE 2 and column 3 has the common records between 2 files.

Option Description
-1 suppress unique records from FILE 1
-2 suppress unique records from FILE 2
-3 suppress common records from FILE 1 and FILE 2
--total print summary; unique records count file 1 and file 2, common records count
--output-delim=CHAR override output delimiter, TAB is the default value

Examples:

Let us tweak the names_v1.txt and names_v2.txt files a bit by keeping 3 common records and 2 unique records each in sorted order.

$ cat names_v1.txt
Cook
Daugherty
Gomez
Gordon
Huang
Macias

$ cat names_v2.txt
Gomez
Gordon
Huang
Macias
Molina
Sherman

comm: basic comparison

$ comm names_v1.txt names_v2.txt
Cook
Daugherty
                Gomez
                Gordon
                Huang
                Macias
        Molina
        Sherman

comm: suppress columns

# show only unique records from FILE 1
$ comm -2 -3 names_v1.txt names_v2.txt
Cook
Daugherty

# show only unique records from FILE 2
$ comm -1 -3 names_v1.txt names_v2.txt
Molina
Sherman

# show common records between FILE 1 and FILE 2
$ comm -1 -2 names_v1.txt names_v2.txt
Gomez
Gordon
Huang
Macias

comm --total: display comparison summary

Even if we suppress, the summary line always shows the count of unique records from file 1, file2 and the common records count.

$ comm --total names_v1.txt names_v2.txt
Cook
Daugherty
                Gomez
                Gordon
                Huang
                Macias
        Molina
        Sherman
2       2       4       total

# show only unique records from FILE 1 and summary
$ comm --total -2 -3 names_v1.txt names_v2.txt
Cook
Daugherty
2       2       4       total

comm: display only summary

$ comm --total -1 -2 -3 names_v1.txt names_v2.txt
2       2       4       total

Override Output delimiter

# no trailing commas as delimiter after columns with
# unique records only
$ comm --output-delimiter=, names_v1.txt names_v2.txt
Cook
Daugherty
,,Gomez
,,Gordon
,,Huang
,,Macias
,Molina
,Sherman

Get unique records from both files

The comm command creates a 3 column output with TAB as column delimiter. If we redirect the output of comm into tr to delete the TAB character, then all 3 columns will become one column. To get the unique rows, we can further sort and uniq the output.

In order to demonstrate this, I have added some duplicate rows in both files

# common and unique rows merged as one column (with duplicates)
$ comm names_v1.txt names_v2.txt | tr -d '\t'
Cook
Daugherty
Gomez
Gordon
Gordon
Gordon
Huang
Macias
Macias
Macias
Molina
Sherman

# sort the output and remove duplicates
# instead of uniq we can use 'sort' with  -u'  
$ comm names_v1.txt names_v2.txt | tr -d '\t' | sort | uniq
Cook
Daugherty
Gomez
Gordon
Huang
Macias
Molina
Sherman

Comparison using file checksum

We have discussed diff, comm and cmp commands that can be used for comparing files. At times, we are just interested in whether the files are identical or not without worrying about the actual differences. This can be accomplished by checking the return code by executing echo $? right after running the comparison command where return code 0 means identical and 1 means differences found.

Another way is to get the checksum, also known as digest or hash value of the files. Same checksum value means the files are identical. The commands cksum, md5sum and shasum can be used to get the checksum values. Each command uses different hashing algorithm to compute the checksum values with cksum being the oldest and shasum being the latest.

The cksum displays checksum, file size and file name. The md5sum and shasum just displays the checksum and file name.

Let us compare the names_v[12].txt files. We will make a copy of v2 as v3 for identical comparison.

Examples:

$ cp -v names_v2.txt names_v3.txt
'names_v2.txt' -> 'names_v3.txt'

#using chksum: v2 and v3 has same checksum 
$ cksum names_v1.txt names_v2.txt names_v3.txt
1628559629 108 names_v1.txt
3965335266 96 names_v2.txt
3965335266 96 names_v3.txt

# same as cksum; v2 and v3 has same checksum
# md5sum is more accurate than cksum
$ md5sum names_v*
9dbff2393937f0724cd532f83756955b  names_v1.txt
9cbf9656ccd27ca37b5a3be7b5b4de71  names_v2.txt
9cbf9656ccd27ca37b5a3be7b5b4de71  names_v3.txt

# shasum is more accurate than md5sum.
# we can even create longer digest using options
# v2 and v3 has same cksum
$ shasum names_v*
d567ba96fc3aa36c28708f9c8075cdf607bd3ea8  names_v1.txt
a44bf3187d86a30e5273e939035d9d916b10568f  names_v2.txt
a44bf3187d86a30e5273e939035d9d916b10568f  names_v3.txt

shuf: shuffle records / arguments

The shuf command can be used to randomize the records from a file or the arguments passed to the command. This comes handy if we need to pick random records or words (arguments delimited by spaces). The shuffled records will be written to stdout. The options allow us to limit the number of records, duplicate records or write the output to a file.

Output Description
-e treat each arg as a input record
-n N limit the number of output records
-r repeat records (duplicate), use along with -n
-o FILE write to FILE

Examples:

shuf -e ARG(s): Randomize inputs passed as arguments

$ shuf -e abcd efgh ijkl mnop qrstu vwxyz
ijkl
abcd
efgh
vwxyz
mnop
qrstu

shuf -n N: Limit output records

$ shuf -n3 -e abcd efgh ijkl mnop qrstu vwxyz
qrstu
ijkl
efgh

shuf -n N -r: repeat input records and limit records

# duplicates: rec=1 and 4, 6 and 7, 5 and 8
$ shuf -n10 -r  -e abcd efgh ijkl mnop qrstu vwxyz | nl
     1  mnop
     2  qrstu
     3  abcd
     4  mnop
     5  vwxyz
     6  ijkl
     7  ijkl
     8  vwxyz
     9  efgh
    10  qrstu

shuf -o FILE: shuffle and write to file

# displays nothing, use `cat FILE` to inspect output
$ shuf -n3  -o shuffle.txt -e abcd efgh ijkl mnop qrstu vwxyz

$ cat shuffle.txt
mnop
abcd
ijkl

shuf: randomize records from a file

Let us add line numbers to names_v1.txt using the nl command and process it thru the shuf command.

DIY: Try other options discussed above with shuf and FILE

$ nl names_v1.txt > names.txt

# original file with record numbers
$ cat names.txt
     1  Cook
     2  Daugherty
     3  Gomez
     4  Gordon
     5  Huang
     6  Macias
     
# shuffled records
$ shuf names.txt
     1  Cook
     2  Daugherty
     3  Gomez
     4  Gordon
     5  Huang
     6  Macias

Use Case

Fake Data Creation Script

Install the faker module from pip before running this script. This document does not support copy-paste really well. you may have to type the command or create a sample file yourself

from faker import Faker
import random

f = Faker()

subjects = ('MATH', 'SCIE', 'SOCS')

cf = open('student.csv', 'w')
ff = open('student.txt', 'w')

print("Id,LastName,FirstName,DOB,grade,subject,score,En",file=cf)
for i in range(1, 101):
  stu_id  = f"S{i:05}"
  lname   = f.last_name()
  fname   = f.first_name()
  birth_year  = random.randint(2009, 2013)
  birth_ddmmm = f.date("%d-%b")
  dob     = f"{birth_ddmmm}-{birth_year}"
  grade   = f"{(2021 - birth_year - 5):02}"

  for subj in subjects:
    scr = random.randrange(51, 100)
    print(f"{stu_id} {lname:10} {fname:10} {dob} {subj} {grade} {scr:3} Z", file=ff)
    print(stu_id, lname, fname, dob, subj, grade, scr, "Z", sep=",", file=cf)

ff.close()
cf.close()