Remove part of the column name of a dataframe using a regular expression in Python
Question:
I have a dataframe "counts" and I would like to change the name of the second column using a regular expression because I have multiple files with this "extra information", so I have:
| GeneID | /home/rmachado/Biotec/ARJNA231684/mapa_fin_starterar/SRR1212121_mapped.bamAligned.sortedByCoord.out.bam |
| -------- | -------------- |
| Ciclev10010164m.g.v1.0 | 2 |
| Ciclev10007306m.g.v1.0 | 647 |
| Ciclev10009318m.g.v1.0 | 39 |
| Ciclev... | ... |
| Ciclev10007306m.g.v1.0 | 112 |
I tried with the following code with no success:
for col in counts1:
counts1.rename(columns={col:col.upper().replace("/home/rmachado/Biotec/ARJNA231684/mapa_fin_starterar/SRR1212121_mapped.bamAligned.sortedByCoord.out.bam","SRR[d]{6}")},inplace=True)
How can I obtain a df with the following format?
| GeneID | SRR1212121 |
| -------- | -------------- |
| Ciclev10010164m.g.v1.0 | 2 |
| Ciclev10007306m.g.v1.0 | 647 |
| Ciclev10009318m.g.v1.0 | 39 |
| Ciclev... | ... |
| Ciclev10007306m.g.v1.0 | 112 |
Answers:
You could try:
df.columns = df.columns.str.extract(r'((?<=/)SRRd+|^[^/]+$)', expand=False)
regex:
(?<=/)SRRd+ # match SDD + digits if preceded by "/"
^[^/]+$ # else match full string if it doesn't contain "/"
I have a dataframe "counts" and I would like to change the name of the second column using a regular expression because I have multiple files with this "extra information", so I have:
| GeneID | /home/rmachado/Biotec/ARJNA231684/mapa_fin_starterar/SRR1212121_mapped.bamAligned.sortedByCoord.out.bam |
| -------- | -------------- |
| Ciclev10010164m.g.v1.0 | 2 |
| Ciclev10007306m.g.v1.0 | 647 |
| Ciclev10009318m.g.v1.0 | 39 |
| Ciclev... | ... |
| Ciclev10007306m.g.v1.0 | 112 |
I tried with the following code with no success:
for col in counts1:
counts1.rename(columns={col:col.upper().replace("/home/rmachado/Biotec/ARJNA231684/mapa_fin_starterar/SRR1212121_mapped.bamAligned.sortedByCoord.out.bam","SRR[d]{6}")},inplace=True)
How can I obtain a df with the following format?
| GeneID | SRR1212121 |
| -------- | -------------- |
| Ciclev10010164m.g.v1.0 | 2 |
| Ciclev10007306m.g.v1.0 | 647 |
| Ciclev10009318m.g.v1.0 | 39 |
| Ciclev... | ... |
| Ciclev10007306m.g.v1.0 | 112 |
You could try:
df.columns = df.columns.str.extract(r'((?<=/)SRRd+|^[^/]+$)', expand=False)
regex:
(?<=/)SRRd+ # match SDD + digits if preceded by "/"
^[^/]+$ # else match full string if it doesn't contain "/"