In this chapter, let us discuss ways to manipulate the data from files; operations such as sort, slice, join, reformat, aggregation. etc.. I have originally named this chapter as File Manipulation Commands, however none of the commands discussed here actually change the input file, these commands just manipulate the contents and display the output in stdout that can be redirected into a file.
This subset of commands are one of the most frequently used commands and gaining expertise using these commands will be really beneficial to developers, testers and anyone who work with files. Along with the commands used for viewing files, IO Redirection and these data manipulation commands, we can craft command pipelines to extract, reformat and store data in files without writing complex programs.
In order to explore the commands, we will use the following 2 files; a fixed format file and a delimited file with same data. Feel free to use these or your own samples with the commands and options specified. A python script to create these files are available at the end of the chapter. In addition to these two files, we will also use few smaller files and the contents of these files will be displayed before being used in the demo.
student.csv
studentId,LastName,FirstName,DOB,grade,subject,score,endOfRec
S00001,Carlson,Crystal,SOCS,15-May-2013,3,98,Z
S00002,Brennan,Angel,MATH,21-Feb-2013,3,74,Z
...
...
S00004,Brooks,Jillian,MATH,10-Dec-2010,6,62,Z
student.txt
S00001 Carlson Crystal 15-May-2013 SOCS 03 98 Z
S00002 Brennan Angel 21-Feb-2013 MATH 03 74 Z
...
...
S00004 Brooks Jillian 10-Dec-2010 MATH 06 62 Z
| # | Name | Description |
|---|---|---|
| 1 | cut |
create slices from each record |
| 2 | paste |
merge multiple lines from a file into single line or merge multiple files line by line |
| 3 | split |
split a file into smaller chunks. By default, each split file contains 1000 lines, we can use options to split by different line count or split by bytes size |
| 4 | join |
merge already sorted files by keys, By default it performs an inner or equality join, only display records with matching keys |
| 5 | sort |
sort files, a rich set of options are available |
| 6 | uniq |
create unique records from an already sorted file |
| 7 | tr |
simple translation and deletion of characters |
| 8 | column |
display file in tabular format |
| 9 | colrm |
remove columns from a file |
| 10 | cmp |
compare two files byte by byte and display summary, from which byte/line there is a difference. We can limit the number of bytes to be compared |
| 11 | diff |
compare two sorted files and display the records that are different |
| 12 | comm |
compare two sorted files and display the unique records from file 1, file 2 and matched records in 3 columns. This command is useful to compare files with shorter records |
| 13 | cksum |
create a checksum and byte count of a given file |
| 14 | md5sum |
create a checksum using MD5 digest algorithm |
| 15 | shasum |
create a checksum using SHA aka Secure Hash Algorithm |
| 16 | shuf |
shuffle records from file |
cut : create slices from each recordThe cut command is used to extract portions of reach record to create a subset of a given file. The subset of records can be created providing column range: col 1-10,25-30,50-end or by fields numbers if we have a delimited file.
Another use of the cut command is to convert a delimited file from one-delimiter to another. for example CSV to pipe-delimited file.
| Option | Description |
|---|---|
-c |
column positions to extract |
-d CH |
delimiter to be used; CH should be a single character. used along with -f. default is TAB |
-f |
fields to extract; works with -d for option |
-s |
skip lines if that does not have delimiters. works with -d option |
--complement |
select columns/fields that are not mentioned in -c or -f options |
--output-delim |
use a different delimiter on output records. By default, input delimiter is used |
Note: While using -f or -c, we need to provide the individual column positions or field numbers or range or a combination of both. We can use the following conventions
m-n : column or field from m to nm,n : column or field m and nm- : from m till end of the recordExamples
-c1-10,25 : columns 1 to 10 and column 25 (11 bytes)-d, -f1-5,10 : fields 1 to 5 and field 10 in a comma-delimited record-c1-10,25,30,40-45,50-: columns 1 to 10, 25, 30, 40 to 45 and 50 till end of record-d, -f1,5,11-15,30-: fields 1, 5, 10 to 15 and from field 30 to the last field in a comma-delimited recordextract last and first names from student.txt
The data is in fixed format, each field start at the same column position. Each field has a space in between and each student has 3 records; one per subject - MATH, SCIE and SOCS.
# sample data
$ head -2 student.txt
S00001 Kim Ivan 19-Nov-2012 MATH 04 99 Z
S00001 Kim Ivan 19-Nov-2012 SCIE 04 99 Z
S00002 Pena Bryan 07-Jun-2013 MATH 03 57 Z
$ cut -c8-17,19-28 student.txt
Kim Ivan
Kim Ivan
Pena Bryan
...
...
Rosario Emily
extract last and first names from student.csv
# sample data
$ head -3 student.csv
Id,LastName,FirstName,DOB,grade,subject,score,En
S00001,Kim,Ivan,MATH,19-Nov-2012,04,99,Z
S00001,Kim,Ivan,SCIE,19-Nov-2012,04,99,Z
# delimiter is comma, get fields 2 and 3
$ cut -d, -f2,3 student.csv
LastName,FirstName
Kim,Ivan
Kim,Ivan
...
...
Rosario,Emily
extract fields other than first and last name
# --complement: get fields other than listed in '-c' of '-f'
$ cut -d, -f2,3 --complement student.csv
Id,DOB,grade,subject,score,En
S00001,MATH,19-Nov-2012,04,99,Z
S00001,SCIE,19-Nov-2012,04,99,Z
...
...
S00003,SOCS,21-Apr-2010,06,57,Z
convert CSV to pipe-delimited file
# -f1-: select all fields
# --output-delim='|' : set separate output delimiter
$ cut -d, -f1- --output-delim='|' student.csv
Id|LastName|FirstName|DOB|grade|subject|score|En
S00001|Kim|Ivan|MATH|19-Nov-2012|04|99|Z
S00001|Kim|Ivan|SCIE|19-Nov-2012|04|99|Z
S00001|Kim|Ivan|SOCS|19-Nov-2012|04|75|Z
...
...
S00030|Dewey|Jack|SOCS|21-Apr-2010|06|57|Z
convert CSV to TAB delimited file
The TAB as output-delimiter works slightly different than any other single-character delimiter. we need to use --output-delim-$'\t' instead of just \t.
# let us skip header
$ tail -n +2 student.csv | cut -d, -f1- --output-delim=$'\t'
S00001 Kim Ivan MATH 19-Nov-2012 04 99 Z
...
...
S00003 Rosario Emily SOCS 21-Apr-2010 06 57 Z
skip records that does not have delimiters
Some delimited files will have blank lines at the end along with summary of record count, exported from database tables for example. we can skip the blank records and the summary record using the -s option.
$ head student.csv
Id,LastName,FirstName,DOB,grade,subject,score,En
S00001,Kim,Ivan,MATH,19-Nov-2012,04,99,Z
...
...
S00100,Newton,Charles,SOCS,26-Jun-2012,04,92,Z
300 record(s) exported from STUDENT table
# last 3 records discarded, remove `-s` and try this command
$ cut -d, -f1- -s student.csv
Id,LastName,FirstName,DOB,grade,subject,score,En
S00001,Kim,Ivan,MATH,19-Nov-2012,04,99,Z
S00001,Kim,Ivan,SCIE,19-Nov-2012,04,99,Z
...
...
S00100,Newton,Charles,SOCS,26-Jun-2012,04,92,Z
Note:
There might be situations where we need to slice and rearrange columns or fields. The cut command is suitable only for slicing and does not support rearranging columns or fields. We will look into this using another command called awk in the future.
paste : merge multiple filesThe paste command typically reads one record from each file, passed as argument and merge the records with a space delimiter and create a single record. If any of these files have less records compared to the other an empty-string is substituted in place.
| Option | Description |
|---|---|
-d CH |
output delimiter, default is TAB |
-s |
paste records as single record from one file at a time, like transpose |
Examples
We have 2 files cmds.txt that contains the commands list and help.txt that contains the command description. The help.txt has one extra record than the cmds.txt. Let us merge the contents of these two files.
**Snippet from sample files **
$ cat cmd.txt
echo
cat
head
tail
$ cat help.txt
display argument on file
concatenate files
display first 10 records
display last 10 records
last record has contents from help.txt only
$ paste cmd.txt help.txt
echo display argument on file
cat concatenate files
head display first 10 records
tail display last 10 records
paste merge records from files
sort files
# using delimiter
echo,display argument on file
cat,concatenate files
head,display first 10 records
tail,display last 10 records
paste,merge records from files
,sort files
Let us say, we have an address file with 3 lines per address and we want to merge the 3 lines into one and create one record per address. We can use paste to merge every 3 records from the address file into single record.
We need to pass the data as standard input or pipe in order to perform the multiple records merge
$ head -6 address.py
Brooks, Sarah
52848 Cynthia Drives
West Jason, MS 82831
Byrd, Kelly
88505 Castro Heights
South Vanessa, MA 96024
# single line address as TAB delimited record
$ head -9 address.py | paste - - -
Brooks, Sarah 52848 Cynthia Drives West Jason, MS 82831
Byrd, Kelly 88505 Castro Heights South Vanessa, MA 96024
# same as above, as comma-delimited record
$ head -9 address.py | paste -d, - - -
Brooks, Sarah,52848 Cynthia Drives,West Jason, MS 82831
Byrd, Kelly,88505 Castro Heights,South Vanessa, MA 96024
split: split a file into pieces by size or record countThe split commands creates smaller chunks from a large file that can be concatenated together to recreate the same file, if needed. In the past, due to the storage size limitations, we may not be able to copy large files in one removable storage devices like floppy, CD,… The split command made such copy possible.
We can create smaller file chunks by specifying number of records per file, max file size or even number of chunks we want. By default, the command splits the file by creating chunks of files with 1000 records per file and the last file to have the left over records.
In addition to various split options, we can provide file prefix and suffix to be used to name the file chunks. If we do not provide static prefix or suffix, the command will use x as prefix and aa, ab, ac,.. as suffix creating file names xaa, xab, xac,..
syntax
$ split [option(s)] FILE [prefix]
common options
| Option | Description |
|---|---|
-b N |
create chunks of N bytes each |
-l N |
create chunks with N records per file |
-n N |
create N files of equal size, may create empty file |
-e |
do not create empty file; used along with -n |
-a N |
suffix length; default is 2: aa, ab, ac, … |
-d |
use numeric suffix increments instead of aa, ab,.. |
-x |
use hexadecimal suffix increments instead of aa, ab, .. |
other options
| Option | Description |
|---|---|
--additional-suffix= |
provide static suffix |
--numeric-suffixes= |
same as -d, allows starting value |
--hex-suffixes= |
same as -d, allows starting value |
examples
Let us a very large file using the names_v1.txt file and shuf command with -n and -r option. The following command created a 1.4GB file and it may take some time to process. We will also create another file with 7550 records to demo split by record count.
$ shuf -r -n 123456789 names_v1.txt -o large_file_split_demo.txt
$ zip large_file_split_demo large_file_split_demo.txt
# file size: 1.4G and 73M respectively
$ stat --printf="%n: %s\n" large_file_split_demo.*
large_file_split_demo.txt: 1481481468
large_file_split_demo.zip: 76052943
$ head -7550 large_file_split_demo.txt > split_demo.txt
$ wc -l split_demo.txt
7550 split_demo.txt
default split
By default, the split command creates file chunks with 1000 records per file.
$ ls
large_file_split_demo.txt
large_file_split_demo.zip
names_v1.txt
split_demo.txt
# created files with 'x' prefix and 2 byte alpha suffix
$ split split_demo.txt
$ ls
large_file_split_demo.txt names_v1.txt xaa xac xae xag
large_file_split_demo.zip split_demo.txt xab xad xaf xah
$ wc -l split_demo.txt xaa xab xaf xah
7550 split_demo.txt
1000 xaa
1000 xab
1000 xac
1000 xad
1000 xae
1000 xaf
1000 xag
550 xah
max records per file
$ rm x*
# split -l 3000: max records per file is 3000
$ split -l 3000 split_demo.txt
$ ls
large_file_split_demo.txt names_v1.txt split_fileaa split_fileac xab
large_file_split_demo.zip split_demo.txt
xaa xab xac
$ wc -l xa*
3000 xaa
3000 xab
1550 xac
7550 total
using prefix / suffix for output files
# pass prefix as the second argument: 'split_file_'
$ split -l 3000 split_demo.txt split_file_
$ wc -l split_file_*
3000 split_file_aa
3000 split_file_ab
1550 split_file_ac
7550 total
# add suffix
$ split -l 3000 split_demo.txt split_file_ --additional-suffix="_chunk.txt"
$ wc -l *chunk*
3000 split_file_aa_chunk.txt
3000 split_file_ab_chunk.txt
1550 split_file_ac_chunk.txt
split into N files
We can use the -n NUM option to specify how may chunks we want to create.
$ split -n 4 split_demo.txt split_file_ --additional-suffix="_chunk.txt"
$ wc -l *chunk*
1887 split_file_aa_chunk.txt
1888 split_file_ab_chunk.txt
1887 split_file_ac_chunk.txt
1888 split_file_ad_chunk.txt
split by size
Let us use the -b N option to create files of max N Bytes size, We can suffix K, M, G, T to specify size in KB, MB, etc.. For this demo, we will use the 73 MB zip file.
We will use the -d N to use numeric suffix instead of alphabetic and -a N option to provide suffix size instead of default 2 bytes. The numeric suffix starts with 0 and incremented by 1 when we use the -d option. If we need the suffix to start with some other number, we can use the --numeric-suffixes=NNN option.
check the zip file details
$ ls -lh large_file_split_demo.zip
-rw-r--r-- 1 mktutes mktutes 73M Aug 11 05:52 large_file_split_demo.zip
# zipinfo: provides info about the zipped files
# like the ls -l command
$ zipinfo large_file_split_demo.zip
Archive: large_file_split_demo.zip
Zip file size: 76052943 bytes, number of entries: 1
-rw-r--r-- 3.0 unx 1481481468 tx defN 21-Aug-11 05:49 large_file_split_demo.txt
1 file, 1481481468 bytes uncompressed, 76052743 bytes compressed: 94.9%
split the files by 10MB chunk
# use numeric suffix of 5 bytes long
# output file prefix is "num_"
$ split -d -a 5 -b 10M large_file_split_demo.zip num_
$ ls -lh num_*
-rw-r--r-- 1 mktutes mktutes 10M Aug 11 07:56 num_00000
-rw-r--r-- 1 mktutes mktutes 10M Aug 11 07:56 num_00001
-rw-r--r-- 1 mktutes mktutes 10M Aug 11 07:56 num_00002
-rw-r--r-- 1 mktutes mktutes 10M Aug 11 07:56 num_00003
-rw-r--r-- 1 mktutes mktutes 10M Aug 11 07:56 num_00004
-rw-r--r-- 1 mktutes mktutes 10M Aug 11 07:56 num_00005
-rw-r--r-- 1 mktutes mktutes 10M Aug 11 07:56 num_00006
-rw-r--r-- 1 mktutes mktutes 2.6M Aug 11 07:56 num_00007
concat file chunks and check the integrity of the zip file
$ cat num_0000* > temp.zip
$ ls -lh temp.zip
-rw-r--r-- 1 mktutes mktutes 73M Aug 11 07:59 temp.zip
$ md5sum *.zip
a6d0061a44673d9493f238850eb20c5a large_file_split_demo.zip
a6d0061a44673d9493f238850eb20c5a temp.zip
$ zipinfo temp.zip
Archive: temp.zip
Zip file size: 76052943 bytes, number of entries: 1
-rw-r--r-- 3.0 unx 1481481468 tx defN 21-Aug-11 05:49 large_file_split_demo.txt
1 file, 1481481468 bytes uncompressed, 76052743 bytes compressed: 94.9%
join: merge two sorted files using a keyThe join command merges two lexically sorted by a common field. By default, it merges the matched records from both files. We can also write unmatched records from one or both files using additional options. By default, the output will have the join key followed by remaining fields from FILE1 and then FILE2.
The command uses space as default field delimiter and considers the first field as a key for comparison. We can override the delimiter and specify different field numbers as comparison key.
The join command works similar to the JOIN operation of SQL queries. However it only uses a single field for matching.
| Option | Description |
|---|---|
-a 1 |
print unmatched key and data from FILE1; left outer join |
-a 2 |
print unmatched key and data from FILE2; right outer join |
-a 1 -a 2 : prints unmatched records from both files; full outer join |
|
-v 1 |
suppress matched records from FILE 1 |
-v 2 |
suppress matched records from FILE 2 |
-v 1 -v 2: suppress matched records from both files |
|
-t |
-t CH:use CH as input field delimiter instead of space |
-i |
ignore case while comparing fields; sort using -f option |
-j N |
use FIELD N from FILE 1 and 2 for comparison instead of first field |
-1 N |
use FIELD N from FILE 1 for comparison |
-2 N |
use FIELD N from FILE 2 for comparison |
-e STR |
print STR against missing values |
we need to use -o auto along with -a and -e |
|
-o FMT |
override default merge order |
-o 0 to display on the keys. -o 1.2 2.2 to display individual fields |
|
use FILENUM.FIELDNUM,... to rearrange fields |
|
--header |
treat first line as header |
sample files
For this demo, let us create two files with numbers in ascending order as first field. The roman numeral and numbers spelling as second fields on file one and two respectively. These files are named roman.txt and english.txt.
We will use --header option to treat the first record from both files as header.
$ cat roman.txt
NBR Roman
1 I
2 II
3 III
4 IV
5 V
$ cat english.txt
NUM Text
0 Zero
1 One
2 Two
3 Three
4 Four
inner join: also known as natural join (default)
The join command without any options merges the two files using first field delimited by spaces and output the matched records from both files. The unmatched records from both files will be skipped. In this example, last record; 5 V from roman.txt and first record; 0 Zero from english.txt will be skipped.
$ join --header roman.txt english.txt
NBR Roman Text
1 I One
2 II Two
3 III Three
4 IV Four
**left outer join: print matched records and unmatched from FILE 1
The -a 1 option prints the unmatched record from FILE 1 along with the matched records. By default, the unmatched records will not have any data for the FILE 2 columns. If we want to print default values, we need to use the -e VAL and -o auto options.
$ join --header -a 1 roman.txt english.txt
NBR Roman Text
1 I One
2 II Two
3 III Three
4 IV Four
5 V
# print NA for missing values; see record: `5 V NA`
$ join --header -a 1 -e NA -o auto roman.txt english.txt
NBR Roman Text
1 I One
2 II Two
3 III Three
4 IV Four
5 V NA
**right outer join: print matched records and unmatched from FILE 2
The -a 2 option prints matched records and unmatched records from FILE 2. Missing values from FILE 1 against the unmatched records will display NA. Look at record 0 NA Zero.
$ join --header -a 2 -e NA -o auto roman.txt english.txt
NBR Roman Text
0 NA Zero
1 I One
2 II Two
3 III Three
4 IV Four
full outer join: print matched and unmatched records from both files
$ join --header -a 1 -a 2 -e NA -o auto roman.txt english.txt
NBR Roman Text
0 NA Zero
1 I One
2 II Two
3 III Three
4 IV Four
5 V NA
extract unmatched records only
Sometimes, we might be interested in missing records; for example, if we run a same code twice with expectations that we should get same number of records with matching fields. We can verify this expectation by running the join command with no options. If the record count of both files are same and the count of output records matches with input count then our expectation is met. However if we see differences in record counts between input files or input and output, we need to know which keys did not match between the two files.
We can use the -v N option to suppress the matched records and -a N to display the unmatched records from FILE N (N=1 or 2). We can get the unmatched records from one of the files or both files. Furthermore, if we are just interested in only the keys, we can display the keys alone using the -o option.
# join:
# --header: treat first record as header and use if for display
# -v 1 : suppress matching records
# -a 1 : display unmatched records from FILE 1
# -o 0 : display only the keys
$ join --header -v 1 -a 1 -o 0 roman.txt english.txt
NBR
5
# unmatched records from FILE 2: -v 2 -a 2
$ join --header -v 2 -a 2 -o 0 roman.txt english.txt
NBR
0
# unmatched records from both files
$ join --header -v 2 -a 2 -a 1 -o 0 roman.txt english.txt
NBR
0
5
override input field delimiter and specify matching field numbers
If the files we use in join has a different delimiter, then we can use the -t CHAR to override the default delimiter (space). Both input files should have the same delimiter.
Let us create a roman.csv file as a comma-delimited file and swap field 1 and 2. Let us also create another file called num_string.csv with numbers spelt in English, German as field 1 and 2 and matching field (arabic numerals) as field 3.
$ head -2 roman.csv
Roman,NBR
I,1
$ head -2 num_string.csv
English,German,NUM
Zero,Null,0
# join -t: input delim, -1 N and -2 N : matching field numbers
$ join --header -t, -1 2 -2 3 roman.csv num_string.csv
NBR,Roman,English,German
1,I,One,Eins
2,II,Two,Zwei
3,III,Three,Drei
4,IV,Four,Vier
reorder output fields
In the above example, the output of join will be formatted as follows; key, fields from file 1 followed by fields from file 2. This can be overridden using the -o option where 0 represents the key field and FILE_NBR:FIELD_NBR to specify the individual fields. For example -o 0 1.2 2.1 is to print key followed by 2nd field from file 1 and 1st field from file 2.
$ $ join --header -t, -1 2 -2 3 -o "0 2.1 1.1 2.2" roman.csv num_string.csv
NBR,English,Roman,German
1,One,I,Eins
2,Two,II,Zwei
3,Three,III,Drei
4,Four,IV,Vier
case-insensitive comparison
When join compares the matching fields, it does a case-sensitive comparison, that makes the uppercase and lowercase letters as distinct entities. Sometimes, we may not care about the case of the alphabets in the key. We can use the -i option to enable case-insensitive comparison.
If we use
-i, the files have to be sorted by keeping uppercase and lowercase values together. See the-foption insortcommand in the next section for more info.
The matching key will be from the first file passed as argument to the
joincommand
For the case-insensitive comparison, let use replace the numbers with alphabets; uppercase in one file and lowercase in another file and create a new set of files.
$ head -3 alpha_roman.txt
NBR Roman
b II
c III
$ head -3 alpha_num.txt
NUM Text
A One
B Two
# inner join: no matches due to case difference
$ join --header alpha_roman.txt alpha_num.txt
NBR Roman Text
# key will be taken from file 1
$ join --header -i alpha_roman.txt alpha_num.txt
NBR Roman Text
b II Two
c III Three
d IV Four
$ join --header -i alpha_num.txt alpha_roman.txt
NUM Text Roman
B Two II
C Three III
D Four IV
sort: sort a fileSome of the commands we discussed so far; uniq and join and few more commands that we will discuss later; file comparison commands for example expect the input file(s) to be sorted. Understanding how the sort command works is very important and gaining expertise in using the command is very important for anyone who want to gain expertise using the CLI.
The sort command by default, sorts the records starting from column 1 all the way to the last column. It uses the ASCII value of the characters on a given column to decide the order in which the records to be sorted. For example, uppercase A will be sorted before lowercase a because of their ASCII values; 65 and 97 respectively. The records are sorted in ascending order by default.
The sort command comes with an extensive collection options to control how the records within a file will be sorted. We will look at the commonly used options and once we become familiar with these options, we can focus on other options.
Here are the sorting options
| Option | Description |
|---|---|
-d |
dictionary order; consider blanks and alphanumeric characters only |
-n |
numeric sort; instead of column-by-column comparison, treat a sequence of numbers as single entity for sorting |
-h |
human-readable sorting on byte size; 10K, 2M, 1G etc.. |
-M |
month sort; JAN, FEB, MAR, etc… |
-k |
specify field numbers / column position as sort keys |
-f |
case-insensitive sort |
some versions of sort does case-insensitive sort by default |
|
-r |
sort in descending order; reverse sort |
-R |
random sort; shuffle the records but keep sort keys of same value together |
-m |
merge already sorted files together |
-u |
unique sort; keep only the first occurrence of record based on the sort key |
Here are some options to override the default configuration settings.
| Option | Description |
|---|---|
-t CH |
override default field separator (space) with CH |
-T DIR |
override default temp directory used (/tmp) |
-o FILE |
write sorted output to FILE |
-c |
check whether a file is already sorted or not |
| error message if not sorted | |
-C |
same as -c but suppress the error message |
use echo $? to check; 0=OK, 1=Not OK |
sort a names file: case-sensitive and case-insensitive
Sorting of alphabets is usually case-sensitive, that is uppercase alphabets are sorted before the lowercase due to their ASCII values; uppercase alphabets have smaller ASCII value compared to the lowercase alphabets. We can use the -f option to perform a case-insensitive sort.
Depending on the sort command version, the case-insensitive comparison may be done by default.
sortcommand in google cloud shell does an case-insensitive sort whereassortcommand in Macbook does a case-sensitive sort.
$ cat names.txt
Macias
gerald
Huang
Gordon
diana
Daugherty
Cook
# mac os: case-sensitive sort by default
$ sort names.txt
Cook
Daugherty
Gordon
Huang
Macias
diana
gerald
# google cloud shell: case-insensitive sort by default
$ sort names.txt
Cook
Daugherty
diana
gerald
Gordon
Huang
Macias
If we want to sort case-sensitive explicitly, we can set the LOCALE environment variable as LC_ALL=C.
# case-sensitive sort: setting LC_ALL=C env variable
# LC_ALL: Language LOCALE settings
# uppercase before lowercase
$ LC_ALL=C sort names.txt
Cook
Daugherty
Gordon
HuangMacias
diana
gerald
# case-insensitive sort
$ LC_ALL=C sort -f names.txt
reverse sort using -r
By default, the records are sorted in ascending order and we can use -r to reverse sort in descending order.
# reverse sort along with case-fold / case-insensitive sort
$ sort -f -r names.txt
Macias
Huang
Gordon
gerald
diana
Daugherty
Cook
numeric sort
By default, sort performs an ASCII sort, comparing the ASCII value of the characters in each column to arrange the records in ascending order. If two records have same value in a given column, sort moves to the next column and performs the comparison.
The -n option overrides this behaviour and forces sort to treat consecutive numbers separated; a field to be treated as a whole in numeric form rather than sequence of characters. Let us sort a file with random number and see how the -n impacts the way sorting occurs.
$ cat rand_nums.txt
22675
81
91
...
..
3
9
8634
default sort; column wise sort that grouped all 1 together at column 1 and sorted ascending on column 2, …
$ sort rand_nums.txt
116
151400
199
22057
22675
3
3149
81
8634
88
9
91
numeric sort: set of characters treated as numbers
$ sort -n rand_nums.txt
3
9
81
88
91
116
199
3149
8634
22057
22675
151400
sorting human-readable file size: KB, MB, GB, TB, …
In order to sort by file-size displayed in human-readable form, sort provides -h option. If we use the ls option to display the file list by size, we can use ls -lSh or ls -lShr to apply sort in the command itself. Sometimes, we may get the details as a report or a file. In that case, we will use -h along with other commands like cut to extract file name and size to sort by file size.
$ sort rand_file_size.txt
1G
1T
20K
2G
2K
2M
3.4K
987
$ sort -h rand_file_size.txt
987
2K
3.4K
20K
2M
1G
2G
1T
sort using portion of a file
The sort command has the -k or --key=KEYDEF option to sort the file using parts of the records. If it is fixed-format file, we can use start and end column numbers. For delimited files, we can specify the delimiter using -t and list the field numbers by which we want the file to be sorted. We can specify more one set of keys.
The -k accepts argument FIELD_NBR.COL_POS with optional sort type like h, n that will be local to the specific key alone. For example -k1.9,1.12n -k1.1,1.5 means perform a numeric sort using columns 9-12 then sort by columns 1-5. For the fixed format files the FIELD_NBR is always 1.
For the delimited files, we can use the field numbers and optionally column numbers within the field itself. For example, -k2 -k5n -k1 means sort by the second field, fifth field numerically and the first field. -k2.6,2.10 means sort by the second field but consider only the column 6-10 from field 2.
Here is the student file we have used in cut demo. Let us sort the file with following criteria
$ cat student_math.csv
S00001,Huang,Amanda,30-NOV-2010,MATH,06,58,Z
S00002,Sherman,Cody,23-MAY-2009,MATH,07,89,Z
S00003,Gordon,Heidi,09-NOV-2009,MATH,07,87,Z
S00004,Gomez,Kelly,29-APR-2010,MATH,06,92,Z
S00005,Cook,Lisa,28-SEP-2011,MATH,05,83,Z
S00006,Daugherty,Joshua,05-APR-2011,MATH,05,94,Z
S00007,Macias,Emma,21-SEP-2009,MATH,07,79,Z
Here is the command with options to perform different types of sorting supported by the sort command; numeric, reverse, month, keys etc.. You may create your own file with duplicate values for any of these fields and try it out for better understanding.
# -t, : comma-delimited file
# -k6,6 : #6 is grade
# -k7,7rn : #7 is score; numeric and descending
# -k4.8,4.11 : #4 is DOB, sort by year @pos 8-11 of field 4
# -k4.4,4.6M : sort by Month in 'MMM' using '-m' option
# -k4.1,4.2 : sort by date
# -k2,2 -k3.3: last name and first name
$ sort -t, -k6,6 -k7,7rn -k4.8,4.11 -k4.4,4.6M -k4.1,4.2 -k2,2 -k3.3 student_math.csv
S00082,Ross,Scott,05-FEB-2010,MATH,06,85,Z
S00017,Stone,James,30-MAR-2010,MATH,06,84,Z
S00041,Rush,Kristina,14-AUG-2010,MATH,06,84,Z
S00078,Golden,Valerie,10-JAN-2010,MATH,06,83,Z
S00058,Ayala,Keith,29-FEB-2010,MATH,06,83,Z
uniq: get unique values, counts from “sorted” dataThe uniq command at its simplest form will display unique values by removing the duplicate entries. In order for the command to work, the input should be in sorted order. Using additional options, the uniq command can
| Option | Description |
|---|---|
-u |
return entries that does not have any duplicates |
-d |
return one occurrence per entry that has duplicates |
-D |
return all occurrences of entries with duplicates |
-c |
count the number of occurrences of each entry |
less commonly used options:
These options can home in handy if we need to check uniqueness only on part of record. We can skip first N bytes or fields not in the middle, I have documented these options for awareness and if interested, please explore on your own
| Option | Description |
|---|---|
-i |
ignore case while comparing |
-f N |
skip N fields (separated by spaces or TAB) |
-s N |
skip N characters from the start for comparison |
-w N |
consider only N characters for comparison |
get uniq subject list from student.txt
The subject code is 4 bytes long string starting at position 42. Here are the steps
cut commandcutuniq$ head -3 student.txt
S00001 Kim Ivan 19-Nov-2012 MATH 04 99 Z
S00001 Kim Ivan 19-Nov-2012 SCIE 04 99 Z
S00001 Kim Ivan 19-Nov-2012 SOCS 04 88 Z
# get subject list
$ cut -c42-45 student.txt | sort | uniq
MATH
SCIE
SOCS
get number of records by grade from student.txt
The grade is 2 bytes long string starting at position 47. Here are the steps
cut commandsort the gradesuniq with -c optionCOUNTS GRADE# grade 03 has 51 records, ...
$ cut -c47-48 student.txt | sort | uniq -c
51 03
60 04
60 05
72 06
57 07
find students with same last name
The student.txt file has multiple records per student; one per subject that needs additional command pipeline. Here are the steps
cutsort the results and get uniq combination of these 3 fieldscutsort the last namesuniq with -d option$ cut -c1-28 student.txt | sort | uniq | cut -c8-17 | sort | uniq -d
Evans
Hernandez
Jones
May
Miller
Rivera
Ward
# The file has 2 students with same last name: WARD
S00065 Ward Daniel
S00093 Ward Marcus
find students with uniq and duplicate last names
We need to do the same steps from the previous example except for the last step. We will add two more commands to cut -c1-28 student.txt | sort | uniq | cut -c8-17 | sort
uniq -uuniq -Dwc -l to just get the count# get count of students with unique last name
$ cut -c1-28 student.txt | sort | uniq | cut -c8-17 | sort | uniq -u | wc -l
84
# get counts of students with shared last names
$ cut -c1-28 student.txt | sort | uniq | cut -c8-17 | sort | uniq -D | wc -l
16
# get record counts of student.txt
$ cut -c1-28 student.txt | sort | uniq | wc -l
100
# 100 == 84 + 16 ---> MATCHES !!!
tr: substitute, delete or squeeze charactersThe tr command is an abbreviation for translate is used to perform data manipulation at individual character level. We can do the following using tr
The
trcommand accepts input from stdin or pipe only
| Option | Description |
|---|---|
-d SET |
delete from input all the characters from SET |
-c |
complement, when used with -d as -dc SET, it deletes all the characters that are not in SET |
-s SET |
remove duplicates of all characters in SET that are next to each other |
-t SET1 SET2 |
while substituting characters from SET1 by SET2, truncate SET1 to match the length of SET2 |
convert uppercase to lowercase
$ echo "HELLO World" | tr 'A-Z' 'a-z'
hello world
$ echo "HELLO World" | tr [[:upper:]] [[:lower:]]
hello world
substitution cipher: ROT13
The tr command can be used to create simple ciphers by substituting one character with another. This is also called as caesar cipher as it was reportedly used by Julius Caesar himself. The ROT13 is a special case of caesar cipher that shifts the character position by 13; a by n, b by o,.. We can use tr to create ciphers like this.
Note: The substituted text will revert itself if we process the text with the tr command that created it in the first place. This can be cracked with repeated attempts using the tr command itself.
$ echo "Hello World" | tr 'A-Za-z' 'N-ZA-Mn-za-m'
Uryyb Jbeyq
# translate to lowercase first and create cipher
$ echo "Hello World" | tr 'A-Z' 'a-z' | tr 'a-z' 'n-za-m'
uryyb jbeyq
# get the original text back
$ echo "uryyb jbeyq" | tr 'a-z' 'n-za-m'
hello world
$ echo "hello world" | tr 'a-z' 'n-za-m' | tr 'a-z' 'n-za-m'
hello world
The character sets used, a-z and n-za-m should be of same length and here a-z means starting from ASCII value of a to z that is 26 lowercase alphabets and n-za-m also has same 26 alphabets.
What happens if we have less characters in SET1 than SET2? The extract characters will be discarded and it will not have any impact. On the other hand, if we have more characters in SET1 compared to SET2, the last character in SET2 will be repeated to match the length of SET1.
For example tr 'aeiou' 'AE' will be treated as tr 'aeiou' 'AEEEE'. If we do not want the default behaviour then we can use the -t flag that truncates the excess characters in SET1. For example tr -t 'aeiou' 'AE' will be treated as tr -t'ae' 'AE'.
$ echo "education" | tr 'aeiou' 'AE'
EdEcAtEEn
$ echo "education" | tr -t 'aeiou' 'AE'
EducAtion
# use case: replace all numbers with *
$ echo 'Current Salary is "65000.00 USD"' | tr '0-9' '*'
Current Salary is "*****.** USD"
delete characters
The -d SET option removes all the characters from SET whereas the -d -c SET will remove all the characters that are not in SET.
$ echo "education" | tr -d 'aeiou'
dctn
$ echo "education" | tr -dc 'aeiou'
euaio
squeeze characters
The -s SET option removes the duplicate characters when they are repeated. we can use it to remove extra spaces in sentences for example.
$ echo "hello world"
hello world
$ echo "hello world" | tr -s ' '
hello world
column: format data in tabular formThe column command is very useful if we want out output to be formatted as column-oriented or tabular output. This comes handy while viewing delimited files as each field may be of different length, viewing and understanding the data may pose difficulties
| Option | Description |
|---|---|
-s SET |
set of delimiters to be used to split record to fields |
-t |
determine the number of columns in the input first |
pretty print CSV file in tabular format
$ head -5 student.csv | cut -d, -f1-7
Id,LastName,FirstName,DOB,grade,subject,score
S00001,Kim,Ivan,MATH,19-Nov-2012,04,99
S00001,Kim,Ivan,SCIE,19-Nov-2012,04,99
S00001,Kim,Ivan,SOCS,19-Nov-2012,04,88
S00002,Pena,Bryan,MATH,07-Jun-2013,03,57
$ head -5 student.csv | cut -d, -f1-7 | column -s, -t
Id LastName FirstName DOB grade subject score
S00001 Kim Ivan MATH 19-Nov-2012 04 99
S00001 Kim Ivan SCIE 19-Nov-2012 04 99
S00001 Kim Ivan SOCS 19-Nov-2012 04 88
S00002 Pena Bryan MATH 07-Jun-2013 03 57
colrm: remove columns from file and printThis is a simpler version of the cut command the remove columns based on the start and stop index provided as argument. If only one number is provided, it will be considered as stop index and the command displays record from column 1 to one column before the index. If both start and stop indices are provided then the command removes the columns from start to the stop. The command accepts a file or input from stdin for processing
The index starts with 1 not 0
We can just remove one set of contiguous columns
using fixed format text file
# sample file
$ head -5 student.txt
S00001 Huang Amanda 30-Nov-2010 MATH 06 58 Z
S00002 Sherman Cody 23-May-2009 MATH 07 89 Z
S00003 Gordon Heidi 09-Nov-2009 MATH 07 87 Z
S00004 Gomez Kelly 29-Apr-2010 MATH 06 92 Z
S00005 Cook Lisa 28-Sep-2011 MATH 05 83 Z
# get student id, last and first name
# extract col 1-29
$ head -5 student.txt | colrm 30
S00001 Huang Amanda
S00002 Sherman Cody
S00003 Gordon Heidi
S00004 Gomez Kelly
S00005 Cook Lisa
# discard data of birth and display other columns
$ head -5 student.txt | colrm 30 41
S00001 Huang Amanda MATH 06 58 Z
S00002 Sherman Cody MATH 07 89 Z
S00003 Gordon Heidi MATH 07 87 Z
S00004 Gomez Kelly MATH 06 92 Z
S00005 Cook Lisa MATH 05 83 Z
using STDIN
# get column 1-4 alone
$ echo 'abcdefghijklm' | colrm 5
abcd
# remove column 5-8: efgh
$ echo 'abcdefghijklm' | colrm 5 8
abcdijklm
# same as above; using cut command
$ echo 'abcdefghijklm' | cut -c5-8 --complement
abcdijklm
cmp: compare two files byte by bytecmp is one of the 3 file comparison command we will discuss in this chapter. This is the simplest command of the 3 that compares two files byte-by-byte. If both files are same, it displays nothing otherwise a message that says files differ at byte N, Line N.
cmpcommand display the line number and the position where the difference starts.
The command sets return code 0 if both files are same, 1 if there is any difference and 2 if there are any issues with the argument or options. Use the echo $? command after running cmp to display the return code.
| Option | Description |
|---|---|
-i N |
skip N bytes from both files before comparison |
-i M:N |
skip M bytes from FILE 1 and |
N bytes from FILE 2 before comparison |
|
-n N |
consider only N bytes for comparison |
-s |
suppresses normal output; |
| error messages will still be displayed. | |
use echo $? to find the status of comparison |
|
-l |
verbose displays byte at which the command found the difference |
| and the value at byte number from both files |
cmpis one of the rare commands that DOES NOT use-vfor turning on verbosity. It uses-linstead.-vis used to display the version of thecmpcommand
Examples:
sample files
$ cat cmp_demo_01.txt
NumBERS: 0123456789X
$ cat cmp_demo_02.txt
Numbers: 0123 5 7 X
$ cat cmp_demo_03.txt
NUMS: 0123456789X
basic comparison
# compare as it is
# difference at byte 4 of line 1; 'B' in file 1 and 'b' in file 2
$ cmp cmp_demo_01.txt cmp_demo_02.txt
cmp_demo_01.txt cmp_demo_02.txt differ: byte 4, line 1
# -i 8: skip first 8 bytes from both files
# 14th byte has '3' in file 1 and ' ' in file 2
$ cmp -i 8 cmp_demo_01.txt cmp_demo_02.txt
cmp_demo_01.txt cmp_demo_02.txt differ: byte 6, line 1
# same command but in silent mode
# use 'echo $?' to display return code
$ cmp -i 8 -s cmp_demo_01.txt cmp_demo_02.txt
# rc: 0=files are same, 1=files are different
$ echo $?
1
# -n 5: compare only 5 bytes after skipping first 8 bytes
# 5 bytes match
$ cmp -i 8 -n 5 cmp_demo_01.txt cmp_demo_02.txt
$ echo $?
0
more comparison
$ cat cmp_demo_01.txt cmp_demo_03.txt
NumBERS: 0123456789X
NUMS: 0123456789X
# -i M:N : skip M bytes from file 1 and N bytes from file 2
# skip "NumBER" from file 1 and "NUM" from file 2
$ cmp -i 6:3 cmp_demo_01.txt cmp_demo_03.txt
# matched
$ echo $?
0
The -l option provides the byte numbers where there are difference and the value from both files. The example below has 6 64 40. This means at byte 6, file 1 has 64 and 40 as internal representation of 4 and spaces. This is not exactly the ASCII value. For more info, look at Little and Big Endian representation of data
# -l: verbose
# skip some bytes and compare files
# the command compares "0123456789X" and "0123 5 7 X"
$ cmp -i8 -l cmp_demo_01.txt cmp_demo_02.txt
6 64 40
8 66 40
10 70 40
11 71 40
diff: compare files line by lineThe diff command is much more sophisticated than the command. It compares two files line-by-line and display the lines that are different. For better comparison, both files should be sorted. There is a diff3 command like diff but it compares 3 files instead of 2.
The command sets return code 0, if both files are identical and 1 if there are differences. We can use the echo $? command to display the return code after running the diff command.
Note: The output of the
diffcommand can be used with another command calledpatchto merge the differences between the two files and update the original file. Discussion onpatchis beyond the scope.
| Option | Description |
|---|---|
-s |
display a message when both files are identical; prints nothing by default |
-q |
display a message when both files differ instead of printing the actual differences |
-i |
ignore case of alphabets during comparison |
-N |
treat file not found as empty file |
-b |
ignore whitespaces |
-y |
print the differences side-by-side; we can use sdiff instead of diff -y |
Other option(s)
--suppress-common-lines can be used with -y to suppress matching records
Examples:
Both files has same contents except few records that has all uppercase. File 2 has one extra record
$ cat student_file_v1.txt
S00001 HUANG AMANDA 30-NOV-2010 MATH 06 58 Z
S00002 Sherman Cody 23-May-2009 MATH 07 89 Z
S00003 Gordon Heidi 09-Nov-2009 MATH 07 87 Z
S00004 GOMEZ KELLY 29-APR-2010 MATH 06 92 Z
S00005 Cook Lisa 28-Sep-2011 MATH 05 83 Z
S00006 Daugherty Joshua 05-Apr-2011 MATH 05 94 Z
S00007 MACIAS EMMA 21-SEP-2009 MATH 07 79 Z
$ cat student_file_v2.txt
S00001 Huang Amanda 30-Nov-2010 MATH 06 58 Z
S00002 Sherman Cody 23-May-2009 MATH 07 89 Z
S00003 Gordon Heidi 09-Nov-2009 MATH 07 87 Z
S00004 Gomez Kelly 29-Apr-2010 MATH 06 92 Z
S00005 Cook Lisa 28-Sep-2011 MATH 05 83 Z
S00006 Daugherty Joshua 05-Apr-2011 MATH 05 94 Z
S00007 Macias Emma 21-Sep-2009 MATH 07 79 Z
S00008 Molina William 21-Sep-2011 MATH 05 57 Z
$ cp -v student_file_v2.txt student_file_v2.txt.bk
'student_file_v2.txt' -> 'student_file_v2.txt.bk'
diff without any options
By default, diff performs a case-sensitive comparison and displays records that are different. In this case records 1, 4 and 7 are different.
Records with difference from FILE 1 are prefixed with
<and FILE 2 are prefixed with>. The---is a placeholder to indicate that there are matched records in between.
$ diff student_file_v1.txt student_file_v2.txt
1c1
< S00001 HUANG AMANDA 30-NOV-2010 MATH 06 58 Z
---
> S00001 Huang Amanda 30-Nov-2010 MATH 06 58 Z
4c4
< S00004 GOMEZ KELLY 29-APR-2010 MATH 06 92 Z
---
> S00004 Gomez Kelly 29-Apr-2010 MATH 06 92 Z
7c7
< S00007 MACIAS EMMA 21-SEP-2009 MATH 07 79 Z
---
> S00007 Macias Emma 21-Sep-2009 MATH 07 79 Z
> S00008 Molina William 21-Sep-2011 MATH 05 57 Z
diff -q: display a message if both files are different
By default diff prints the differences between two files. If we are not concerned about the actual differences, we can use the -q option to print a message when there are differences.
$ diff -q student_file_v1.txt student_file_v2.txt
Files student_file_v1.txt and student_file_v2.txt differ
$ echo $?
1
diff -s: display a message if both files are identical
# no output if files are identical: default
$ diff student_file_v2.txt student_file_v2.txt.bk
# -s: display a message if both files are identical
$ diff -s student_file_v2.txt student_file_v2.txt.bk
Files student_file_v2.txt and student_file_v2.txt.bk are identical
Discard the output and inspect $?
The -s and -q commands display the status as message when the files are identical and different respectively. However since we will not know in advance whether the files are going to match or not. If we do not need the actual differences, we can redirect the output of the diff command to /dev/null, a special device file that acts like a Recycle Bin. We call it as bitbucket or black hole
# discard differences
$ diff student_file_v1.txt student_file_v2.txt > /dev/null
# echo $?: 0=identical, 1=differences
$ echo $?
1
# compare original and backup; no messages if files are identical
$ diff student_file_v2.txt student_file_v2.txt.bk
# echo $?: 0=identical, 1=differences
$ echo $?
0
diff -i: case-insensitive comparison
The first example we saw, the diff command without any option performs a case-sensitive comparison. At times, the only difference between two files might be the case of the alphabet. For example, version 1 of the file might have names in title case and version 2 of the file is changed to have names in uppercase. If we want to compare these files and find out the real differences such as extra or missing records or differences in fields, we can use the -i option to perform a case-insensitive comparison.
# records 1, 4 and 7 matched with -i option
$ diff -i student_file_v1.txt student_file_v2.txt
7a8
> S00008 Molina William 21-Sep-2011 MATH 05 57 Z
diff -N: Dealing with File Not Found issues
By default, if one of the file passed as argument is non-existing, the diff command displays an error message and set return code 2. The -N option instructs diff to treat non-existent file as an empty files.
$ diff student_file_v4.txt student_file_v1.txt
diff: student_file_v4.txt: No such file or directory
$ diff -N student_file_v4.txt student_file_v1.txt
0a1,7
> S00001 HUANG AMANDA 30-NOV-2010 MATH 06 58 Z
> S00002 Sherman Cody 23-May-2009 MATH 07 89 Z
diff -b: Dealing with leading and trailing blanks
Let us extract the last names from the student_file_v*.txt and compare the files. We will cut 10 bytes from one file and 11 bytes from another file and run diff with and without -b option.
# add an empty line and extract last names from v1 file
$ cut -c8-18 student_file_v1.txt > names_v1.txt
# extract last names from v1 file and add an empty line
$ cut -c8-17 student_file_v2.txt > names_v2.txt
# display one record from each file and show newline as $
$ head -1 names_v* | cat -E
==> names_v1.txt <==$
HUANG $
==> names_v2.txt <==$
Huang $
# compare case-insensitive and ignore leading/trailing blanks
# one extra record "Molina" in file 2 and
# one blank rec each
$ diff -bi names_v1.txt names_v2.txt
1d0
<
8a8,10
> Molina
>
diff -y: side by side comparison
We can use diff -y or sdiff to compare two files and display the matched and unmatched records side-by-side. Unlike the default diff, the -y option displays both matched and unmatched records. If we do not want to see the matched records, we can use the --suppress-matched-lines option along with -y.
$ diff -y names_v1.txt names_v2.txt
| Huang
HUANG | Sherman
Sherman | Gordon
Gordon | Gomez
GOMEZ | Cook
Cook | Daugherty
Daugherty | Macias
MACIAS |
> Molina
>
# suppress trailing blanks and case-insensitive compare
# discard matched records
# one blank records in each file and one extra record
$ diff -b -y -i --suppress-common-lines names_v1.txt names_v2.txt
<
>
> Molina
comm: compare 2 sorted files side-by-sideThe comm command is a good choice to compare files with short records like name lists. It compares the files and displays a 3 column output; column 1 contains records unique to FILE 1, column 2 is unique records from FILE 2 and column 3 has the common records between 2 files.
| Option | Description |
|---|---|
-1 |
suppress unique records from FILE 1 |
-2 |
suppress unique records from FILE 2 |
-3 |
suppress common records from FILE 1 and FILE 2 |
--total |
print summary; unique records count file 1 and file 2, common records count |
--output-delim=CHAR |
override output delimiter, TAB is the default value |
Examples:
Let us tweak the names_v1.txt and names_v2.txt files a bit by keeping 3 common records and 2 unique records each in sorted order.
$ cat names_v1.txt
Cook
Daugherty
Gomez
Gordon
Huang
Macias
$ cat names_v2.txt
Gomez
Gordon
Huang
Macias
Molina
Sherman
comm: basic comparison
$ comm names_v1.txt names_v2.txt
Cook
Daugherty
Gomez
Gordon
Huang
Macias
Molina
Sherman
comm: suppress columns
# show only unique records from FILE 1
$ comm -2 -3 names_v1.txt names_v2.txt
Cook
Daugherty
# show only unique records from FILE 2
$ comm -1 -3 names_v1.txt names_v2.txt
Molina
Sherman
# show common records between FILE 1 and FILE 2
$ comm -1 -2 names_v1.txt names_v2.txt
Gomez
Gordon
Huang
Macias
comm --total: display comparison summary
Even if we suppress, the summary line always shows the count of unique records from file 1, file2 and the common records count.
$ comm --total names_v1.txt names_v2.txt
Cook
Daugherty
Gomez
Gordon
Huang
Macias
Molina
Sherman
2 2 4 total
# show only unique records from FILE 1 and summary
$ comm --total -2 -3 names_v1.txt names_v2.txt
Cook
Daugherty
2 2 4 total
comm: display only summary
$ comm --total -1 -2 -3 names_v1.txt names_v2.txt
2 2 4 total
Override Output delimiter
# no trailing commas as delimiter after columns with
# unique records only
$ comm --output-delimiter=, names_v1.txt names_v2.txt
Cook
Daugherty
,,Gomez
,,Gordon
,,Huang
,,Macias
,Molina
,Sherman
Get unique records from both files
The comm command creates a 3 column output with TAB as column delimiter. If we redirect the output of comm into tr to delete the TAB character, then all 3 columns will become one column. To get the unique rows, we can further sort and uniq the output.
In order to demonstrate this, I have added some duplicate rows in both files
# common and unique rows merged as one column (with duplicates)
$ comm names_v1.txt names_v2.txt | tr -d '\t'
Cook
Daugherty
Gomez
Gordon
Gordon
Gordon
Huang
Macias
Macias
Macias
Molina
Sherman
# sort the output and remove duplicates
# instead of uniq we can use 'sort' with -u'
$ comm names_v1.txt names_v2.txt | tr -d '\t' | sort | uniq
Cook
Daugherty
Gomez
Gordon
Huang
Macias
Molina
Sherman
checksumWe have discussed diff, comm and cmp commands that can be used for comparing files. At times, we are just interested in whether the files are identical or not without worrying about the actual differences. This can be accomplished by checking the return code by executing echo $? right after running the comparison command where return code 0 means identical and 1 means differences found.
Another way is to get the checksum, also known as digest or hash value of the files. Same checksum value means the files are identical. The commands cksum, md5sum and shasum can be used to get the checksum values. Each command uses different hashing algorithm to compute the checksum values with cksum being the oldest and shasum being the latest.
The cksum displays checksum, file size and file name. The md5sum and shasum just displays the checksum and file name.
Let us compare the names_v[12].txt files. We will make a copy of v2 as v3 for identical comparison.
Examples:
$ cp -v names_v2.txt names_v3.txt
'names_v2.txt' -> 'names_v3.txt'
#using chksum: v2 and v3 has same checksum
$ cksum names_v1.txt names_v2.txt names_v3.txt
1628559629 108 names_v1.txt
3965335266 96 names_v2.txt
3965335266 96 names_v3.txt
# same as cksum; v2 and v3 has same checksum
# md5sum is more accurate than cksum
$ md5sum names_v*
9dbff2393937f0724cd532f83756955b names_v1.txt
9cbf9656ccd27ca37b5a3be7b5b4de71 names_v2.txt
9cbf9656ccd27ca37b5a3be7b5b4de71 names_v3.txt
# shasum is more accurate than md5sum.
# we can even create longer digest using options
# v2 and v3 has same cksum
$ shasum names_v*
d567ba96fc3aa36c28708f9c8075cdf607bd3ea8 names_v1.txt
a44bf3187d86a30e5273e939035d9d916b10568f names_v2.txt
a44bf3187d86a30e5273e939035d9d916b10568f names_v3.txt
shuf: shuffle records / argumentsThe shuf command can be used to randomize the records from a file or the arguments passed to the command. This comes handy if we need to pick random records or words (arguments delimited by spaces). The shuffled records will be written to stdout. The options allow us to limit the number of records, duplicate records or write the output to a file.
| Output | Description |
|---|---|
-e |
treat each arg as a input record |
-n N |
limit the number of output records |
-r |
repeat records (duplicate), use along with -n |
-o FILE |
write to FILE |
Examples:
shuf -e ARG(s): Randomize inputs passed as arguments
$ shuf -e abcd efgh ijkl mnop qrstu vwxyz
ijkl
abcd
efgh
vwxyz
mnop
qrstu
shuf -n N: Limit output records
$ shuf -n3 -e abcd efgh ijkl mnop qrstu vwxyz
qrstu
ijkl
efgh
shuf -n N -r: repeat input records and limit records
# duplicates: rec=1 and 4, 6 and 7, 5 and 8
$ shuf -n10 -r -e abcd efgh ijkl mnop qrstu vwxyz | nl
1 mnop
2 qrstu
3 abcd
4 mnop
5 vwxyz
6 ijkl
7 ijkl
8 vwxyz
9 efgh
10 qrstu
shuf -o FILE: shuffle and write to file
# displays nothing, use `cat FILE` to inspect output
$ shuf -n3 -o shuffle.txt -e abcd efgh ijkl mnop qrstu vwxyz
$ cat shuffle.txt
mnop
abcd
ijkl
shuf: randomize records from a file
Let us add line numbers to names_v1.txt using the nl command and process it thru the shuf command.
DIY: Try other options discussed above with
shufand FILE
$ nl names_v1.txt > names.txt
# original file with record numbers
$ cat names.txt
1 Cook
2 Daugherty
3 Gomez
4 Gordon
5 Huang
6 Macias
# shuffled records
$ shuf names.txt
1 Cook
2 Daugherty
3 Gomez
4 Gordon
5 Huang
6 Macias
Fake Data Creation Script
Install the
fakermodule frompipbefore running this script. This document does not support copy-paste really well. you may have to type the command or create a sample file yourself
from faker import Faker
import random
f = Faker()
subjects = ('MATH', 'SCIE', 'SOCS')
cf = open('student.csv', 'w')
ff = open('student.txt', 'w')
print("Id,LastName,FirstName,DOB,grade,subject,score,En",file=cf)
for i in range(1, 101):
stu_id = f"S{i:05}"
lname = f.last_name()
fname = f.first_name()
birth_year = random.randint(2009, 2013)
birth_ddmmm = f.date("%d-%b")
dob = f"{birth_ddmmm}-{birth_year}"
grade = f"{(2021 - birth_year - 5):02}"
for subj in subjects:
scr = random.randrange(51, 100)
print(f"{stu_id} {lname:10} {fname:10} {dob} {subj} {grade} {scr:3} Z", file=ff)
print(stu_id, lname, fname, dob, subj, grade, scr, "Z", sep=",", file=cf)
ff.close()
cf.close()