How to transpose unique entries in a column and print values from another column
Question:
I have a file input.txt with two columns, I want to split the second column by ";" and transpose the unique entries then count and list how many matches are in column 1.
This is my tab-delimited input.txt file
Gene Biological_Process
BALF2 metabolic process
CHD4 cell organization and biogenesis;metabolic process;regulation of biological process
TCOF1 cell organization and biogenesis;regulation of biological process;transport
TOP1 cell death;cell division;cell organization and biogenesis;metabolic process;regulation of biological process;response to stimulus
BcLF1 0
BALF5 metabolic process
MTA2 cell organization and biogenesis;metabolic process;regulation of biological process
MSH6 cell organization and biogenesis;metabolic process;regulation of biological process;response to stimulus
my expected output1
Biological_Process Gene
metabolic process BALF2 CHD4 TOP1 BALF5 MTA2 MSH6
cell organization and biogenesis CHD4 TCOF1 TOP1 MTA2 MSH6
regulation of biological process CHD4 TCOF1 TOP1 MTA2 MSH6
transport TCOF1
cell death TOP1
cell division TOP1
response to stimulus TOP1 MSH6
Answers:
You’ll need to parse all the data first, e.g. start with a blank dictionary and then read each line of your file (skip line 0 if it’s a header) open your file ... iterate over each line
, for every entry in columns > 0 create a dictionary key for that string with its value as the string from column = 0 using string methods like split
and strip
and dict[gene...] = process...
. Then print/write out each .items
from the dict:
input.txt
gene process
A cell org bio
B cell bio
C 0
D org
script.py
#!/usr/bin/env python
def main():
pros = {}
with open("input.txt", "r") as ifile:
for line in ifile:
cols = line.strip().split()
if len(cols) >= 1:
for pro in cols[1:]:
if pro not in pros:
pros[pro] = []
pros[pro] += [cols[0]]
with open("output.txt", "w") as ofile:
for key,val in pros.items():
ofile.writelines(f'{key}t' + 't'.join(val) + 'n')
if __name__ == "__main__":
main()
run
$ chmod +x ./script.py
$ ./script.py
$ cat ./output.txt
output.txt
process gene
cell A B
org A D
bio A B
0 C
$ cat script.awk
#! /usr/bin/awk -f
BEGIN {
FS = "[t;]"; # sep can be a regex
OFS = "t"
}
NR>1 && /^[A-Z]/{ # skip header & blank lines
for(i=NF; i>1; i--)
if($i) # skip empty bio-proc
a[$i] = a[$i] OFS $1
}
END{
print "Biological_Process","Gene(s)"
for(x in a)
print x a[x]
}
$ ./script.awk input.dat
Biological_Process Gene(s)
cell death TOP1
regulation of biological process CHD4 TCOF1 TOP1 MTA2 MSH6
transport TCOF1
cell division TOP1
metabolic process BALF2 CHD4 TOP1 BALF5 MTA2 MSH6
response to stimulus TOP1 MSH6
cell organization and biogenesis CHD4 TCOF1 TOP1 MTA2 MSH6
I have a file input.txt with two columns, I want to split the second column by ";" and transpose the unique entries then count and list how many matches are in column 1.
This is my tab-delimited input.txt file
Gene Biological_Process
BALF2 metabolic process
CHD4 cell organization and biogenesis;metabolic process;regulation of biological process
TCOF1 cell organization and biogenesis;regulation of biological process;transport
TOP1 cell death;cell division;cell organization and biogenesis;metabolic process;regulation of biological process;response to stimulus
BcLF1 0
BALF5 metabolic process
MTA2 cell organization and biogenesis;metabolic process;regulation of biological process
MSH6 cell organization and biogenesis;metabolic process;regulation of biological process;response to stimulus
my expected output1
Biological_Process Gene
metabolic process BALF2 CHD4 TOP1 BALF5 MTA2 MSH6
cell organization and biogenesis CHD4 TCOF1 TOP1 MTA2 MSH6
regulation of biological process CHD4 TCOF1 TOP1 MTA2 MSH6
transport TCOF1
cell death TOP1
cell division TOP1
response to stimulus TOP1 MSH6
You’ll need to parse all the data first, e.g. start with a blank dictionary and then read each line of your file (skip line 0 if it’s a header) open your file ... iterate over each line
, for every entry in columns > 0 create a dictionary key for that string with its value as the string from column = 0 using string methods like split
and strip
and dict[gene...] = process...
. Then print/write out each .items
from the dict:
input.txt
gene process
A cell org bio
B cell bio
C 0
D org
script.py
#!/usr/bin/env python
def main():
pros = {}
with open("input.txt", "r") as ifile:
for line in ifile:
cols = line.strip().split()
if len(cols) >= 1:
for pro in cols[1:]:
if pro not in pros:
pros[pro] = []
pros[pro] += [cols[0]]
with open("output.txt", "w") as ofile:
for key,val in pros.items():
ofile.writelines(f'{key}t' + 't'.join(val) + 'n')
if __name__ == "__main__":
main()
run
$ chmod +x ./script.py
$ ./script.py
$ cat ./output.txt
output.txt
process gene
cell A B
org A D
bio A B
0 C
$ cat script.awk
#! /usr/bin/awk -f
BEGIN {
FS = "[t;]"; # sep can be a regex
OFS = "t"
}
NR>1 && /^[A-Z]/{ # skip header & blank lines
for(i=NF; i>1; i--)
if($i) # skip empty bio-proc
a[$i] = a[$i] OFS $1
}
END{
print "Biological_Process","Gene(s)"
for(x in a)
print x a[x]
}
$ ./script.awk input.dat
Biological_Process Gene(s)
cell death TOP1
regulation of biological process CHD4 TCOF1 TOP1 MTA2 MSH6
transport TCOF1
cell division TOP1
metabolic process BALF2 CHD4 TOP1 BALF5 MTA2 MSH6
response to stimulus TOP1 MSH6
cell organization and biogenesis CHD4 TCOF1 TOP1 MTA2 MSH6