awk + adding a column baed on values of another column + adding a field name in the 1 command
Question:
I want to add a new column at the end, based on the text of another column(with an if statement), and then I want to add a new column/field name.
I am close but I am struggling with the syntax, I am using awk, but apologies its been a while since I used this. and I am wondering if I should use python/anaconda(jupyter notebook), but going with the easiest env I have available to me at the minute, awk .
This is my file:
$ cat file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
Here I want, based on the text in column 4, create a new column at the end and, but I am winging this a bit, that is I got it to work.
$ awk -F, '{if (substr($4,1,1)=="A")
print $0 (NR>1 ? FS substr($4,1,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,2) : "")
}' file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
But here I wnat to add a field/column name at the end, which I am close, I believe.
$ awk -F, -v OFS=, 'NR==1{ print $0, "test"}
NR>1
{
if (substr($4,1,1)=="A")
print $0 (NR>1 ? FS substr($4,1,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,2) : "")
}
' file1
f1,f2,f3,f4,f5,test
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
What I want is this:
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
EDIT1
for my ref: this is the awk I want:
awk -F, '{if (substr($4,1,1)=="P")
print $0 (NR>1 ? FS substr($4,5,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,4) : "")
}' file1
outputting it to file2:
awk -F, '{if (substr($4,1,1)=="P")
print $0 (NR>1 ? FS substr($4,5,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,4) : "")
}' file1 > file2
$
$
2 files, file2 has other column added:
$ls
file1 file2
$cat file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5
$cat file2
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
EDIT2 — Correction
file 2 is what I want:
cat file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5
cat file2
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
awk -F, -v OFS=, 'NR==1{ print $0, "test"}
NR>1 {
if (substr($4,1,1)=="P")
print $0 (NR>1 ? FS substr($4,5,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,4) : "")
}
' file1
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
Answers:
You may use this awk
command that removes any carriage return if present from each line before computing value of the last column:
awk 'BEGIN {FS=OFS=","}
{
sub(/r$/, "")
print $0, (NR==1 ? "test" : (substr($4,1,1)=="A" ? substr($4,1,4) : substr($4,1,2)))
}' file
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
awk
one-liner :
gawk '$++NF =(_^=!_)==NR ? "test" : substr(__=$4,_++,_^_^(__~"^A"))' FS=, OFS=,
|
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
to deal with Windows/DOS files, do this instead :
mawk 'BEGIN { RS ="r?n"; ___ = "test"; OFS = FS = ","
_++ } $++NF = _==NR ? ___ : substr(__=$4,_,++_^_--^(__~"^A"))'
This works because the regex
selects whether to take 2^2^1
or 2^2^0
, which works out to 4 and 2 respectively
python
solution using solely standard library, let file.csv
content be
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
then
import csv
with open('file.csv', newline='') as infile:
with open('fileout.csv', 'w', newline='') as outfile:
reader = csv.DictReader(infile)
outfields = reader.fieldnames + ['test']
writer = csv.DictWriter(outfile, outfields)
writer.writeheader()
for row in reader:
row['test'] = row['f4'][:4] if row['f4'][0] == 'A' else row['f4'][:2]
writer.writerow(row)
creates (or overwrite if already exist) file fileout.csv
with followinc content
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
Explanation: I am using csv.DictReader
and csv.DictWriter
from csv
, firstly I create two context managers (with open
…) with newline suitable for reader and writer (see linked docs for further explanation), infile is opened for reading (default), whilst outfile for writing (w
), I use reader for parsing infile, I concatenate its fieldnames (column names) with list holding single element test
to get fieldnames for output file, then I output header (column names), then for each data row in input file I compute value for test using ternary operator (observe it is valueiftrue if
condition else
valueiffales which is different order than GNU AWK
‘s condtion?
valueiftrue:
valueiffalse) and string slicing ([:n]
means take n first character) and insert that into row
dict, which is then written.
(tested in Python 3.8.10)
Newlines matter. Change:
NR>1
{
to
NR>1 {
As written you have 2 independent statements equivalent to:
NR>1 { print }
<true condition> {
if (whatever) print foo; else print bar
}
instead of what you intended:
NR>1 {
if (whatever) print foo; else print bar
}
Having said that, try this instead of what you have:
awk '
BEGIN { FS=OFS="," }
NR == 1 { x = "test" }
NR > 1 { x = substr( $4, 1, ($4 ~ /^A/ ? 4 : 2) ) }
{ print $0, x }
' file1
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
Same functionality, just more concise.
Rename x
to some mnemonic of whatever you really intend that last column to represent.
I want to add a new column at the end, based on the text of another column(with an if statement), and then I want to add a new column/field name.
I am close but I am struggling with the syntax, I am using awk, but apologies its been a while since I used this. and I am wondering if I should use python/anaconda(jupyter notebook), but going with the easiest env I have available to me at the minute, awk .
This is my file:
$ cat file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
Here I want, based on the text in column 4, create a new column at the end and, but I am winging this a bit, that is I got it to work.
$ awk -F, '{if (substr($4,1,1)=="A")
print $0 (NR>1 ? FS substr($4,1,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,2) : "")
}' file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
But here I wnat to add a field/column name at the end, which I am close, I believe.
$ awk -F, -v OFS=, 'NR==1{ print $0, "test"}
NR>1
{
if (substr($4,1,1)=="A")
print $0 (NR>1 ? FS substr($4,1,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,2) : "")
}
' file1
f1,f2,f3,f4,f5,test
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
What I want is this:
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
EDIT1
for my ref: this is the awk I want:
awk -F, '{if (substr($4,1,1)=="P")
print $0 (NR>1 ? FS substr($4,5,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,4) : "")
}' file1
outputting it to file2:
awk -F, '{if (substr($4,1,1)=="P")
print $0 (NR>1 ? FS substr($4,5,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,4) : "")
}' file1 > file2
$
$
2 files, file2 has other column added:
$ls
file1 file2
$cat file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5
$cat file2
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
EDIT2 — Correction
file 2 is what I want:
cat file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5
cat file2
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
awk -F, -v OFS=, 'NR==1{ print $0, "test"}
NR>1 {
if (substr($4,1,1)=="P")
print $0 (NR>1 ? FS substr($4,5,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,4) : "")
}
' file1
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
You may use this awk
command that removes any carriage return if present from each line before computing value of the last column:
awk 'BEGIN {FS=OFS=","}
{
sub(/r$/, "")
print $0, (NR==1 ? "test" : (substr($4,1,1)=="A" ? substr($4,1,4) : substr($4,1,2)))
}' file
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
awk
one-liner :
gawk '$++NF =(_^=!_)==NR ? "test" : substr(__=$4,_++,_^_^(__~"^A"))' FS=, OFS=,
|
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
to deal with Windows/DOS files, do this instead :
mawk 'BEGIN { RS ="r?n"; ___ = "test"; OFS = FS = ","
_++ } $++NF = _==NR ? ___ : substr(__=$4,_,++_^_--^(__~"^A"))'
This works because the regex
selects whether to take 2^2^1
or 2^2^0
, which works out to 4 and 2 respectively
python
solution using solely standard library, let file.csv
content be
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
then
import csv
with open('file.csv', newline='') as infile:
with open('fileout.csv', 'w', newline='') as outfile:
reader = csv.DictReader(infile)
outfields = reader.fieldnames + ['test']
writer = csv.DictWriter(outfile, outfields)
writer.writeheader()
for row in reader:
row['test'] = row['f4'][:4] if row['f4'][0] == 'A' else row['f4'][:2]
writer.writerow(row)
creates (or overwrite if already exist) file fileout.csv
with followinc content
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
Explanation: I am using csv.DictReader
and csv.DictWriter
from csv
, firstly I create two context managers (with open
…) with newline suitable for reader and writer (see linked docs for further explanation), infile is opened for reading (default), whilst outfile for writing (w
), I use reader for parsing infile, I concatenate its fieldnames (column names) with list holding single element test
to get fieldnames for output file, then I output header (column names), then for each data row in input file I compute value for test using ternary operator (observe it is valueiftrue if
condition else
valueiffales which is different order than GNU AWK
‘s condtion?
valueiftrue:
valueiffalse) and string slicing ([:n]
means take n first character) and insert that into row
dict, which is then written.
(tested in Python 3.8.10)
Newlines matter. Change:
NR>1
{
to
NR>1 {
As written you have 2 independent statements equivalent to:
NR>1 { print }
<true condition> {
if (whatever) print foo; else print bar
}
instead of what you intended:
NR>1 {
if (whatever) print foo; else print bar
}
Having said that, try this instead of what you have:
awk '
BEGIN { FS=OFS="," }
NR == 1 { x = "test" }
NR > 1 { x = substr( $4, 1, ($4 ~ /^A/ ? 4 : 2) ) }
{ print $0, x }
' file1
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
Same functionality, just more concise.
Rename x
to some mnemonic of whatever you really intend that last column to represent.