Removing elements of dataframe with different number of columns

Question:

I have a tsv file that looks like this:

BCM92732.1  sialidase   Abditibacteriota    bacterium
VTR99890.1  sialidase   :   Sialidase   Precursor   OS=Rhodopirellula   baltica (strain SH1)    GN=RB3353   PE=4    SV=1:   BNR_2   Tuwongella  immobilis
QEL17956.1  putative    retaining   sialidase   Limnoglobus roseus
AMV31440.1  Sialidase   precursor   Pirellula   sp. SH-Sr6A

I want to "clean" this file, for example, removing some columns:

BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella   immobilis
QEL17956.1 Limnoglobus  roseus
AMV31440.1 Pirellula    sp. SH-Sr6A

I thought about removing columns, however, the number of columns is different between rows.
I´m using bash for that task.
Is any better way to do that? For example, python or perl?

Asked By: Mauri1313

||

Answers:

Here is one approach to the problem that provides the correct answer on the example data provided:

# Create a working dir
mkdir -p taxdump
cd taxdump

# Create your 'example.tsv' file
cat <<EOF > example.tsv
BCM92732.1  sialidase   Abditibacteriota    bacterium
VTR99890.1  sialidase   :   Sialidase   Precursor   OS=Rhodopirellula   baltica (strain SH1)    GN=RB3353   PE=4    SV=1:   BNR_2   Tuwongella  immobilis
QEL17956.1  putative    retaining   sialidase   Limnoglobus roseus
AMV31440.1  Sialidase   precursor   Pirellula   sp. SH-Sr6A
EOF

# Download and unpack the NCBI taxonomy database (~50mb)
curl "https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz" -o taxdump.tar.gz
tar -zxvf taxdump.tar.gz

# Grab the second column (the species names) and clean up the whitespace
awk '
BEGIN{FS="|"}
$0 ~ "scientific name" {
    gsub("t", "", $2)
    sub("^ ", "", $2)
    print $2
}' names.dmp > bacteria_names

# Use AWK to cycle through bacteria_names (i.e. every species in the db)
# and check for matches with your tsv file ("example.tsv").
# Start by looking at the last four columns of example.tsv,
# then the last three columns, then the last two columns.

awk '
NR==FNR {
    four_names[$1" "$2" "$3" "$4]
    three_names[$1" "$2" "$3]
    two_names[$1" "$2]
    next
}
{
    four_example=$(NF-3)" "$(NF-2)" "$(NF-1)" "$NF
    three_example=$(NF-2)" "$(NF-1)" "$NF
    two_example=$(NF-1)" "$NF
    if (four_example in four_names)
    {
        print $1, four_example
    }
    else if (three_example in three_names)
    {
        print $1, three_example
    }
    else if (two_example in two_names)
    {
        print $1, two_example
    }
}' bacteria_names example.tsv

# Results:
BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella immobilis
QEL17956.1 Limnoglobus roseus
AMV31440.1 Pirellula sp. SH-Sr6A
Answered By: jared_mamrot

rquery can do this without much effort.

$rq -q "p d/  /r| select @1,foreach(%-1,%,$) | f @%>2" samples/test1.csv -m error
BCM92732.1       Abditibacteriota       bacterium
VTR99890.1       Tuwongella     immobilis
QEL17956.1       Limnoglobus    roseus
AMV31440.1       Pirellula       sp. SH-Sr6A

or

$rq -q "p d/  /r| select @1,@(n-1),@n | f @%>2" samples/test1.csv -m error
BCM92732.1       Abditibacteriota       bacterium
VTR99890.1       Tuwongella     immobilis
QEL17956.1       Limnoglobus    roseus
AMV31440.1       Pirellula       sp. SH-Sr6A

Check out the latest version from here: https://github.com/fuyuncat/rquery/releases

Answered By: WeDBA
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.