IndexError: tuple index out of range when creating PySpark DataFrame

Question:

I want to create test data in a pyspark dataframe but I always get the same "tuple index out of range" error. I do not get this error when reading a csv. Would appreciate any thoughts on why I’m getting this error.

The first thing I tried was create a pandas dataframe and convert it to a pyspark dataframe:

columns = ["id","col_"]
data = [("1", "blue"), ("2", "green"), 
        ("3", "purple"), ("4", "red"), 
        ("5", "yellow")]

df = pd.DataFrame(data=data, columns=columns)

sparkdf = spark.createDataFrame(df)
sparkdf.show()

output:

PicklingError: Could not serialize object: IndexError: tuple index out of range

I get the same error if I try to create the dataframe from RDD per SparkbyExamples.com instructions:

rdd = spark.sparkContext.parallelize(data)
sparkdf = spark.createDataFrame(rdd).toDF(*columns)
sparkdf.show()

I also tried the following and got the same error:

import pyspark.pandas as ps
df1 = ps.from_pandas(df)

Here is the full error when running the above code:

IndexError                                Traceback (most recent call last)
File c:UsersjonatAppDataLocalProgramsPythonPython311Libsite-packagespysparkserializers.py:458, in CloudPickleSerializer.dumps(self, obj)
    457 try:
--> 458     return cloudpickle.dumps(obj, pickle_protocol)
    459 except pickle.PickleError:

File c:UsersjonatAppDataLocalProgramsPythonPython311Libsite-packagespysparkcloudpicklecloudpickle_fast.py:73, in dumps(obj, protocol, buffer_callback)
     70 cp = CloudPickler(
     71     file, protocol=protocol, buffer_callback=buffer_callback
     72 )
---> 73 cp.dump(obj)
     74 return file.getvalue()

File c:UsersjonatAppDataLocalProgramsPythonPython311Libsite-packagespysparkcloudpicklecloudpickle_fast.py:602, in CloudPickler.dump(self, obj)
    601 try:
--> 602     return Pickler.dump(self, obj)
    603 except RuntimeError as e:

File c:UsersjonatAppDataLocalProgramsPythonPython311Libsite-packagespysparkcloudpicklecloudpickle_fast.py:692, in CloudPickler.reducer_override(self, obj)
    691 elif isinstance(obj, types.FunctionType):
--> 692     return self._function_reduce(obj)
    693 else:
    694     # fallback to save_global, including the Pickler's
    695     # dispatch_table

File c:UsersjonatAppDataLocalProgramsPythonPython311Libsite-packagespysparkcloudpicklecloudpickle_fast.py:565, in CloudPickler._function_reduce(self, obj)
    564 else:
--> 565     return self._dynamic_function_reduce(obj)

File c:UsersjonatAppDataLocalProgramsPythonPython311Libsite-packagespysparkcloudpicklecloudpickle_fast.py:546, in CloudPickler._dynamic_function_reduce(self, func)
    545 newargs = self._function_getnewargs(func)
--> 546 state = _function_getstate(func)
    547 return (types.FunctionType, newargs, state, None, None,
    548         _function_setstate)

File c:UsersjonatAppDataLocalProgramsPythonPython311Libsite-packagespysparkcloudpicklecloudpickle_fast.py:157, in _function_getstate(func)
    146 slotstate = {
    147     "__name__": func.__name__,
    148     "__qualname__": func.__qualname__,
   (...)
    154     "__closure__": func.__closure__,
    155 }
--> 157 f_globals_ref = _extract_code_globals(func.__code__)
    158 f_globals = {k: func.__globals__[k] for k in f_globals_ref if k in
    159              func.__globals__}

File c:UsersjonatAppDataLocalProgramsPythonPython311Libsite-packagespysparkcloudpicklecloudpickle.py:334, in _extract_code_globals(co)
    331 # We use a dict with None values instead of a set to get a
    332 # deterministic order (assuming Python 3.6+) and avoid introducing
    333 # non-deterministic pickle bytes as a results.
--> 334 out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
    336 # Declaring a function inside another one using the "def ..."
    337 # syntax generates a constant code object corresponding to the one
    338 # of the nested function's As the nested function may itself need
    339 # global variables, we need to introspect its code, extract its
    340 # globals, (look for code object in it's co_consts attribute..) and
    341 # add the result to code_globals

File c:UsersjonatAppDataLocalProgramsPythonPython311Libsite-packagespysparkcloudpicklecloudpickle.py:334, in <dictcomp>(.0)
    331 # We use a dict with None values instead of a set to get a
    332 # deterministic order (assuming Python 3.6+) and avoid introducing
    333 # non-deterministic pickle bytes as a results.
--> 334 out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
    336 # Declaring a function inside another one using the "def ..."
    337 # syntax generates a constant code object corresponding to the one
    338 # of the nested function's As the nested function may itself need
    339 # global variables, we need to introspect its code, extract its
    340 # globals, (look for code object in it's co_consts attribute..) and
    341 # add the result to code_globals

IndexError: tuple index out of range

During handling of the above exception, another exception occurred:

PicklingError                             Traceback (most recent call last)
Cell In [67], line 2
      1 rdd = spark.sparkContext.parallelize(data)
----> 2 df1 = ps.from_pandas(df)
      3 sparkdf = spark.createDataFrame(rdd).toDF(*columns)
      4 #Create a dictionary from each row in col_

File c:UsersjonatAppDataLocalProgramsPythonPython311Libsite-packagespysparkpandasnamespace.py:153, in from_pandas(pobj)
    151     return Series(pobj)
    152 elif isinstance(pobj, pd.DataFrame):
--> 153     return DataFrame(pobj)
    154 elif isinstance(pobj, pd.Index):
    155     return DataFrame(pd.DataFrame(index=pobj)).index

File c:UsersjonatAppDataLocalProgramsPythonPython311Libsite-packagespysparkpandasframe.py:450, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    448     else:
    449         pdf = pd.DataFrame(data=data, index=index, columns=columns, dtype=dtype, copy=copy)
--> 450     internal = InternalFrame.from_pandas(pdf)
    452 object.__setattr__(self, "_internal_frame", internal)
...
    466     msg = "Could not serialize object: %s: %s" % (e.__class__.__name__, emsg)
    467 print_exec(sys.stderr)
--> 468 raise pickle.PicklingError(msg)

PicklingError: Could not serialize object: IndexError: tuple index out of range
Asked By: Jonathan P

||

Answers:

After doing some reading I checked https://pyreadiness.org/3.11 and it looks like the latest version of python is not supported by pyspark. I was able to resolve this problem by downgrading to python 3.9

Answered By: Jonathan P
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.