pyreadstat read and write spss without data loss
Question:
To read an spss .sav file using pandas/pyreadstat, you use:
df, meta = pyreadstat.read_sav()
to write a dataframe, you use:
pyreadstat.write_sav(df)
How can I read, edit and write a .sav file without losing any meta data, like labels and other things that can be changed in spss?
If this is not possible entirely, what would be the closest to not losing data this way?
Answers:
The function write_sav has many arguments to set different pieces of metadata, for example column_labels, variable_value_labels etc.
When using read_sav you will get in addition to the dataframe df, a metadata object meta, there you will find many of these pieces of metadata in the original file. You can edit them (or not) and then pass them to write_sav different arguments so that the metadata is set.
Having said this, it is probably not possible to set every piece of metadata as it was in SPSS, so this is as close as you can get.
Please read the documentation to see what arguments you have for write_sav, and what pieces of metadata you get when reading read_sav. This documentation also points you to places in the README where it explains how to set the different pieces of metadata, so the README is also a good source of information.
Talk is cheap, here’s the code. 🙂
# using pyreadstat
from pyreadstat import write_sav
class TempFile(type(pathlib.Path())): # type: ignore
def __exit__(self, exc_type, exc_val, exc_tb):
filepath = str(self.absolute())
try:
os.remove(filepath)
except OSError:
logger.exception('romve temporary file: %s failed!', filepath)
self._closed = True
class SpssTool:
@classmethod
def to_spss(cls, df: DataFrame, io: BytesIO, metadata: metadata_container, *, compress: bool = False):
"""Writes a pandas dataframe to a BytesIO object.
Parameters
----------
df : pandas.DataFrame
pandas data frame to write to sav or zsav
io : BytesIO
the buffer to save spss file
metadata: metadata_container
spss file meta data container
compress : bool
whether compress to zsav.
"""
df.columns = SpssTool.get_legal_column_names(df.columns.to_list())
with TempFile(f'/tmp/{uuid4().hex}.{"zsav" if compress else "sav"}') as fp:
write_sav(
df=df,
dst_path=fp,
column_labels=metadata.column_labels if metadata else None,
variable_value_labels=dict(metadata.variable_value_labels) if metadata else {},
variable_measure=metadata.variable_measure if metadata else None,
)
io.write(fp.read_bytes())
Some expalinations:
SpssTool.get_legal_column_names
this is needed because spss file has restriction about the column name, see official document for details: https://www.ibm.com/docs/en/spss-statistics/27.0.0?topic=view-variable-names
metadata_container
This is from from pyreadstat import metadata_container
. the container holding info about the dataset, you could find more detail in : https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html#metadata-object-description
Those maybe what you need.
To read an spss .sav file using pandas/pyreadstat, you use:
df, meta = pyreadstat.read_sav()
to write a dataframe, you use:
pyreadstat.write_sav(df)
How can I read, edit and write a .sav file without losing any meta data, like labels and other things that can be changed in spss?
If this is not possible entirely, what would be the closest to not losing data this way?
The function write_sav has many arguments to set different pieces of metadata, for example column_labels, variable_value_labels etc.
When using read_sav you will get in addition to the dataframe df, a metadata object meta, there you will find many of these pieces of metadata in the original file. You can edit them (or not) and then pass them to write_sav different arguments so that the metadata is set.
Having said this, it is probably not possible to set every piece of metadata as it was in SPSS, so this is as close as you can get.
Please read the documentation to see what arguments you have for write_sav, and what pieces of metadata you get when reading read_sav. This documentation also points you to places in the README where it explains how to set the different pieces of metadata, so the README is also a good source of information.
Talk is cheap, here’s the code. 🙂
# using pyreadstat
from pyreadstat import write_sav
class TempFile(type(pathlib.Path())): # type: ignore
def __exit__(self, exc_type, exc_val, exc_tb):
filepath = str(self.absolute())
try:
os.remove(filepath)
except OSError:
logger.exception('romve temporary file: %s failed!', filepath)
self._closed = True
class SpssTool:
@classmethod
def to_spss(cls, df: DataFrame, io: BytesIO, metadata: metadata_container, *, compress: bool = False):
"""Writes a pandas dataframe to a BytesIO object.
Parameters
----------
df : pandas.DataFrame
pandas data frame to write to sav or zsav
io : BytesIO
the buffer to save spss file
metadata: metadata_container
spss file meta data container
compress : bool
whether compress to zsav.
"""
df.columns = SpssTool.get_legal_column_names(df.columns.to_list())
with TempFile(f'/tmp/{uuid4().hex}.{"zsav" if compress else "sav"}') as fp:
write_sav(
df=df,
dst_path=fp,
column_labels=metadata.column_labels if metadata else None,
variable_value_labels=dict(metadata.variable_value_labels) if metadata else {},
variable_measure=metadata.variable_measure if metadata else None,
)
io.write(fp.read_bytes())
Some expalinations:
SpssTool.get_legal_column_names
this is needed because spss file has restriction about the column name, see official document for details: https://www.ibm.com/docs/en/spss-statistics/27.0.0?topic=view-variable-names
metadata_container
This is from
from pyreadstat import metadata_container
. the container holding info about the dataset, you could find more detail in : https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html#metadata-object-description
Those maybe what you need.