awk + adding a column baed on values of another column + adding a field name in the 1 command

Question

I want to add a new column at the end, based on the text of another column(with an if statement), and then I want to add a new column/field name.
I am close but I am struggling with the syntax, I am using awk, but apologies its been a while since I used this. and I am wondering if I should use python/anaconda(jupyter notebook), but going with the easiest env I have available to me at the minute, awk .

This is my file:

$ cat file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5

Here I want, based on the text in column 4, create a new column at the end and, but I am winging this a bit, that is I got it to work.

  $ awk -F, '{if (substr($4,1,1)=="A")
print $0 (NR>1 ? FS substr($4,1,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,2) : "")
}' file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF

But here I wnat to add a field/column name at the end, which I am close, I believe.

   $ awk -F, -v OFS=, 'NR==1{ print $0, "test"}
NR>1
{
if (substr($4,1,1)=="A")
print $0 (NR>1 ? FS substr($4,1,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,2) : "")
}
' file1
f1,f2,f3,f4,f5,test
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF

What I want is this:

f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF

EDIT1

for my ref: this is the awk I want:

awk -F, '{if (substr($4,1,1)=="P")
print $0 (NR>1 ? FS substr($4,5,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,4) : "")
}' file1

outputting it to file2:

awk -F, '{if (substr($4,1,1)=="P")
print $0 (NR>1 ? FS substr($4,5,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,4) : "")
}' file1 > file2
$
$

2 files, file2 has other column added:

$ls
file1  file2
$cat file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5
$cat file2
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ

EDIT2 — Correction

file 2 is what I want:

cat file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5



cat file2
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ


awk -F, -v OFS=, 'NR==1{ print $0, "test"}
NR>1 {
if (substr($4,1,1)=="P")
print $0 (NR>1 ? FS substr($4,5,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,4) : "")
}
' file1
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ

Asked By: HattrickNZ

||

Source

Answer 1

You may use this awk command that removes any carriage return if present from each line before computing value of the last column:

awk 'BEGIN {FS=OFS=","} 
{
   sub(/r$/, "")
   print $0, (NR==1 ? "test" : (substr($4,1,1)=="A" ? substr($4,1,4) : substr($4,1,2)))
}' file

f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF

Answered By: anubhava

Answer 2

awk one-liner :

gawk '$++NF =(_^=!_)==NR ? "test" : substr(__=$4,_++,_^_^(__~"^A"))' FS=, OFS=,

|

f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF

to deal with Windows/DOS files, do this instead :

mawk 'BEGIN {    RS ="r?n"; ___ = "test"; OFS = FS = ","   
       _++  } $++NF = _==NR ? ___ : substr(__=$4,_,++_^_--^(__~"^A"))'

This works because the regex selects whether to take 2^2^1 or 2^2^0, which works out to 4 and 2 respectively

Answered By: RARE Kpop Manifesto

Answer 3

python solution using solely standard library, let file.csv content be

f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5

then

import csv
with open('file.csv', newline='') as infile:
    with open('fileout.csv', 'w', newline='') as outfile:
        reader = csv.DictReader(infile)
        outfields = reader.fieldnames + ['test']
        writer = csv.DictWriter(outfile, outfields)
        writer.writeheader()
        for row in reader:
            row['test'] = row['f4'][:4] if row['f4'][0] == 'A' else row['f4'][:2]
            writer.writerow(row)

creates (or overwrite if already exist) file fileout.csv with followinc content

f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF

Explanation: I am using csv.DictReader and csv.DictWriter from csv, firstly I create two context managers (with open…) with newline suitable for reader and writer (see linked docs for further explanation), infile is opened for reading (default), whilst outfile for writing (w), I use reader for parsing infile, I concatenate its fieldnames (column names) with list holding single element test to get fieldnames for output file, then I output header (column names), then for each data row in input file I compute value for test using ternary operator (observe it is valueiftrue if condition else valueiffales which is different order than GNU AWK‘s condtion?valueiftrue:valueiffalse) and string slicing ([:n] means take n first character) and insert that into row dict, which is then written.

(tested in Python 3.8.10)

Answered By: Daweo

Answer 4

Newlines matter. Change:

NR>1
{

to

NR>1 {

As written you have 2 independent statements equivalent to:

NR>1 { print }
<true condition> {
    if (whatever) print foo; else print bar
}

instead of what you intended:

NR>1 {
    if (whatever) print foo; else print bar
}

Having said that, try this instead of what you have:

awk '
    BEGIN { FS=OFS="," }
    NR == 1 { x = "test" }
    NR > 1  { x = substr( $4, 1, ($4 ~ /^A/ ? 4 : 2) ) }
    { print $0, x }
' file1
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF

Same functionality, just more concise.

Rename x to some mnemonic of whatever you really intend that last column to represent.

Answered By: Ed Morton

awk + adding a column baed on values of another column + adding a field name in the 1 command

Question:

EDIT1

EDIT2 — Correction

Answers: