How does only output only dataframe columns to csv?

Question:

Use case: golfing on the CLI in a utility function that I can’t afford to make complicated.

I need to peek at only the column names only of a large file in binary format, and not the column names plus, say, the first data row.

In my current implementation, I have to write the burdensome command to peek at the first row of large files:

my-tool peek -n 1 huge-file.parquet | head -n 1 | tr ',' 'n' | less

What I would like is to:

my-tool peek --cols huge-file.parquet | tr ',' 'n' | less

or

my-tool peek --cols -d 'n' huge-file.parquet | less

Without getting complicated in python. I currently use the following mechanism to generate the csv:

out = StringIO()
df.to_csv(out)
print(out.getvalue())

Is there a DataFrame-ish way to output just the columns to out via to_csv(...) or similarly simple technique?

Asked By: Chris

||

Answers:

Maybe something like this?

import pandas as pd
import numpy as np


if __name__ == "__main__":
    # some fake data for setup
    np.random.seed(1)
    df = pd.DataFrame(
        data=np.random.random(size=(5, 5)),
        columns=list("abcde")
    )

    out = df.columns.to_frame(name="columns")
    out.to_csv("file.csv", index=False)
    print(out)
  columns
a       a
b       b
c       c
d       d
e       e

csv output

Answered By: Ian Thompson
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.