Removing elements of dataframe with different number of columns
Question:
I have a tsv file that looks like this:
BCM92732.1 sialidase Abditibacteriota bacterium
VTR99890.1 sialidase : Sialidase Precursor OS=Rhodopirellula baltica (strain SH1) GN=RB3353 PE=4 SV=1: BNR_2 Tuwongella immobilis
QEL17956.1 putative retaining sialidase Limnoglobus roseus
AMV31440.1 Sialidase precursor Pirellula sp. SH-Sr6A
I want to "clean" this file, for example, removing some columns:
BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella immobilis
QEL17956.1 Limnoglobus roseus
AMV31440.1 Pirellula sp. SH-Sr6A
I thought about removing columns, however, the number of columns is different between rows.
I´m using bash for that task.
Is any better way to do that? For example, python or perl?
Answers:
Here is one approach to the problem that provides the correct answer on the example data provided:
# Create a working dir
mkdir -p taxdump
cd taxdump
# Create your 'example.tsv' file
cat <<EOF > example.tsv
BCM92732.1 sialidase Abditibacteriota bacterium
VTR99890.1 sialidase : Sialidase Precursor OS=Rhodopirellula baltica (strain SH1) GN=RB3353 PE=4 SV=1: BNR_2 Tuwongella immobilis
QEL17956.1 putative retaining sialidase Limnoglobus roseus
AMV31440.1 Sialidase precursor Pirellula sp. SH-Sr6A
EOF
# Download and unpack the NCBI taxonomy database (~50mb)
curl "https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz" -o taxdump.tar.gz
tar -zxvf taxdump.tar.gz
# Grab the second column (the species names) and clean up the whitespace
awk '
BEGIN{FS="|"}
$0 ~ "scientific name" {
gsub("t", "", $2)
sub("^ ", "", $2)
print $2
}' names.dmp > bacteria_names
# Use AWK to cycle through bacteria_names (i.e. every species in the db)
# and check for matches with your tsv file ("example.tsv").
# Start by looking at the last four columns of example.tsv,
# then the last three columns, then the last two columns.
awk '
NR==FNR {
four_names[$1" "$2" "$3" "$4]
three_names[$1" "$2" "$3]
two_names[$1" "$2]
next
}
{
four_example=$(NF-3)" "$(NF-2)" "$(NF-1)" "$NF
three_example=$(NF-2)" "$(NF-1)" "$NF
two_example=$(NF-1)" "$NF
if (four_example in four_names)
{
print $1, four_example
}
else if (three_example in three_names)
{
print $1, three_example
}
else if (two_example in two_names)
{
print $1, two_example
}
}' bacteria_names example.tsv
# Results:
BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella immobilis
QEL17956.1 Limnoglobus roseus
AMV31440.1 Pirellula sp. SH-Sr6A
rquery can do this without much effort.
$rq -q "p d/ /r| select @1,foreach(%-1,%,$) | f @%>2" samples/test1.csv -m error
BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella immobilis
QEL17956.1 Limnoglobus roseus
AMV31440.1 Pirellula sp. SH-Sr6A
or
$rq -q "p d/ /r| select @1,@(n-1),@n | f @%>2" samples/test1.csv -m error
BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella immobilis
QEL17956.1 Limnoglobus roseus
AMV31440.1 Pirellula sp. SH-Sr6A
Check out the latest version from here: https://github.com/fuyuncat/rquery/releases
I have a tsv file that looks like this:
BCM92732.1 sialidase Abditibacteriota bacterium
VTR99890.1 sialidase : Sialidase Precursor OS=Rhodopirellula baltica (strain SH1) GN=RB3353 PE=4 SV=1: BNR_2 Tuwongella immobilis
QEL17956.1 putative retaining sialidase Limnoglobus roseus
AMV31440.1 Sialidase precursor Pirellula sp. SH-Sr6A
I want to "clean" this file, for example, removing some columns:
BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella immobilis
QEL17956.1 Limnoglobus roseus
AMV31440.1 Pirellula sp. SH-Sr6A
I thought about removing columns, however, the number of columns is different between rows.
I´m using bash for that task.
Is any better way to do that? For example, python or perl?
Here is one approach to the problem that provides the correct answer on the example data provided:
# Create a working dir
mkdir -p taxdump
cd taxdump
# Create your 'example.tsv' file
cat <<EOF > example.tsv
BCM92732.1 sialidase Abditibacteriota bacterium
VTR99890.1 sialidase : Sialidase Precursor OS=Rhodopirellula baltica (strain SH1) GN=RB3353 PE=4 SV=1: BNR_2 Tuwongella immobilis
QEL17956.1 putative retaining sialidase Limnoglobus roseus
AMV31440.1 Sialidase precursor Pirellula sp. SH-Sr6A
EOF
# Download and unpack the NCBI taxonomy database (~50mb)
curl "https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz" -o taxdump.tar.gz
tar -zxvf taxdump.tar.gz
# Grab the second column (the species names) and clean up the whitespace
awk '
BEGIN{FS="|"}
$0 ~ "scientific name" {
gsub("t", "", $2)
sub("^ ", "", $2)
print $2
}' names.dmp > bacteria_names
# Use AWK to cycle through bacteria_names (i.e. every species in the db)
# and check for matches with your tsv file ("example.tsv").
# Start by looking at the last four columns of example.tsv,
# then the last three columns, then the last two columns.
awk '
NR==FNR {
four_names[$1" "$2" "$3" "$4]
three_names[$1" "$2" "$3]
two_names[$1" "$2]
next
}
{
four_example=$(NF-3)" "$(NF-2)" "$(NF-1)" "$NF
three_example=$(NF-2)" "$(NF-1)" "$NF
two_example=$(NF-1)" "$NF
if (four_example in four_names)
{
print $1, four_example
}
else if (three_example in three_names)
{
print $1, three_example
}
else if (two_example in two_names)
{
print $1, two_example
}
}' bacteria_names example.tsv
# Results:
BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella immobilis
QEL17956.1 Limnoglobus roseus
AMV31440.1 Pirellula sp. SH-Sr6A
rquery can do this without much effort.
$rq -q "p d/ /r| select @1,foreach(%-1,%,$) | f @%>2" samples/test1.csv -m error
BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella immobilis
QEL17956.1 Limnoglobus roseus
AMV31440.1 Pirellula sp. SH-Sr6A
or
$rq -q "p d/ /r| select @1,@(n-1),@n | f @%>2" samples/test1.csv -m error
BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella immobilis
QEL17956.1 Limnoglobus roseus
AMV31440.1 Pirellula sp. SH-Sr6A
Check out the latest version from here: https://github.com/fuyuncat/rquery/releases