pyreadstat read and write spss without data loss

Question:

To read an spss .sav file using pandas/pyreadstat, you use:

df, meta = pyreadstat.read_sav()

to write a dataframe, you use:

pyreadstat.write_sav(df)

How can I read, edit and write a .sav file without losing any meta data, like labels and other things that can be changed in spss?

If this is not possible entirely, what would be the closest to not losing data this way?

Asked By: El Hocko

||

Answers:

The function write_sav has many arguments to set different pieces of metadata, for example column_labels, variable_value_labels etc.

When using read_sav you will get in addition to the dataframe df, a metadata object meta, there you will find many of these pieces of metadata in the original file. You can edit them (or not) and then pass them to write_sav different arguments so that the metadata is set.

Having said this, it is probably not possible to set every piece of metadata as it was in SPSS, so this is as close as you can get.

Please read the documentation to see what arguments you have for write_sav, and what pieces of metadata you get when reading read_sav. This documentation also points you to places in the README where it explains how to set the different pieces of metadata, so the README is also a good source of information.

documentation
readme

Answered By: Otto Fajardo

Talk is cheap, here’s the code. 🙂

# using pyreadstat
from pyreadstat import write_sav

class TempFile(type(pathlib.Path())):  # type: ignore
    def __exit__(self, exc_type, exc_val, exc_tb):
        filepath = str(self.absolute())
        try:
            os.remove(filepath)
        except OSError:
            logger.exception('romve temporary file: %s failed!', filepath)
        self._closed = True

class SpssTool:
    @classmethod
    def to_spss(cls, df: DataFrame, io: BytesIO, metadata: metadata_container, *, compress: bool = False):
        """Writes a pandas dataframe to a BytesIO object.

        Parameters
        ----------
        df : pandas.DataFrame
            pandas data frame to write to sav or zsav
        io : BytesIO
            the buffer to save spss file
        metadata: metadata_container
            spss file meta data container
        compress : bool
            whether compress to zsav.
        """

        df.columns = SpssTool.get_legal_column_names(df.columns.to_list())

        with TempFile(f'/tmp/{uuid4().hex}.{"zsav" if compress else "sav"}') as fp:
            write_sav(
                df=df,
                dst_path=fp,
                column_labels=metadata.column_labels if metadata else None,
                variable_value_labels=dict(metadata.variable_value_labels) if metadata else {},
                variable_measure=metadata.variable_measure if metadata else None,
            )
            io.write(fp.read_bytes())

Some expalinations:

  • SpssTool.get_legal_column_names

this is needed because spss file has restriction about the column name, see official document for details: https://www.ibm.com/docs/en/spss-statistics/27.0.0?topic=view-variable-names

  • metadata_container

This is from from pyreadstat import metadata_container. the container holding info about the dataset, you could find more detail in : https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html#metadata-object-description

Those maybe what you need.

Answered By: tomy0608
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.