PySpark: Create subset dataframe with rows from another based on a column condition

Question

So, I have been trying to create two subsets of a dataframe using a condition involving another dataframe I have. To put this into perspective, let’s say I have three dataframes:

Ratings: contains BOTH series’ and movies’ ratings

| ID | Rating |
| -- | ------ |
| 1  | 7      |
| 2  | 9      |
| 3  | 5      |
| 4  | 10     |
| 5  | 2      |
| 6  | 9      |

Movies: only contains movies

| ID | Name      |
| -- | --------- |
| 2  | John Wick |
| 4  | Titanic   |
| 5  | Sharknado |

Series: only contains series

| ID | Name         |
| -- | ------------ |
| 1  | Breaking Bad |
| 3  | Friends      |
| 6  | The Office   |

How can I divide Ratings into Ratings_movies (only contains movie ratings) and Ratings_series (only contains series ratings)? I have already tried using filters and even merging the different dataframes in order to then drop the rows where the name is null, to no avail.

In my case, ratsfull is the "Ratings" dataframe:

Filter attempt

ratsfull_movies = ratsfull.filter(ratsfull.anime_id.isin(movies.ID))

Results in:

AnalysisException                         Traceback (most recent call last)
Input In [29], in <cell line: 1>()
----> 1 ratsfull_movies = ratsfull.filter(ratsfull.anime_id.isin(movies.ID))

File ~Documentssparkspark-3.0.1-bin-hadoop2.7pythonpysparksqldataframe.py:1461, in DataFrame.filter(self, condition)
   1459     jdf = self._jdf.filter(condition)
   1460 elif isinstance(condition, Column):
-> 1461     jdf = self._jdf.filter(condition._jc)
   1462 else:
   1463     raise TypeError("condition should be string or Column")

File ~Documentssparkspark-3.0.1-bin-hadoop2.7pythonpy4jjava_gateway.py:1304, in JavaMember.__call__(self, *args)
   1298 command = proto.CALL_COMMAND_NAME +
   1299     self.command_header +
   1300     args_command +
   1301     proto.END_COMMAND_PART
   1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
   1305     answer, self.gateway_client, self.target_id, self.name)
   1307 for temp_arg in temp_args:
   1308     temp_arg._detach()

File ~Documentssparkspark-3.0.1-bin-hadoop2.7pythonpysparksqlutils.py:134, in capture_sql_exception.<locals>.deco(*a, **kw)
    130 converted = convert_exception(e.java_exception)
    131 if not isinstance(converted, UnknownException):
    132     # Hide where the exception came from that shows a non-Pythonic
    133     # JVM exception message.
--> 134     raise_from(converted)
    135 else:
    136     raise

File <string>:3, in raise_from(e)

AnalysisException: Resolved attribute(s) ID#302 missing from user_id#1021,anime_id#1025,rating#1029 in operator !Filter anime_id#1025 IN (ID#302).;;
!Filter anime_id#1025 IN (ID#302)
+- Project [user_id#1021, anime_id#1025, cast(rating#914 as int) AS rating#1029]
   +- Project [user_id#1021, cast(anime_id#913 as int) AS anime_id#1025, rating#914]
      +- Project [cast(user_id#912 as int) AS user_id#1021, anime_id#913, rating#914]
         +- Union
            :- Relation[user_id#912,anime_id#913,rating#914] csv
            +- Relation[_c0#958,_c1#959,_c2#960] csv

Merge attempt

ratsfull_movies = ratsfull.merge(movies, how = "outer", left_on = "anime_id", right_on = "ID")
ratsfull_movies.show()

Results in:

AttributeError                            Traceback (most recent call last)
Input In [30], in <cell line: 1>()
----> 1 ratsfull_movies = ratsfull.merge(movies, how = "outer", left_on = "anime_id", right_on = "ID")

File ~Documentssparkspark-3.0.1-bin-hadoop2.7pythonpysparksqldataframe.py:1400, in DataFrame.__getattr__(self, name)
   1394 """Returns the :class:`Column` denoted by ``name``.
   1395 
   1396 >>> df.select(df.age).collect()
   1397 [Row(age=2), Row(age=5)]
   1398 """
   1399 if name not in self.columns:
-> 1400     raise AttributeError(
   1401         "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
   1402 jc = self._jdf.apply(name)
   1403 return Column(jc)

AttributeError: 'DataFrame' object has no attribute 'merge'

Wished output is two dataframes:

Ratings_movies (only contains movie ratings)

| ID | Rating |
| -- | ------ |
| 2  | 9      |
| 4  | 10     |
| 5  | 2      |

Ratings_series (only contains series ratings)

| ID | Rating |
| -- | ------ |
| 1  | 7      |
| 3  | 5      |
| 6  | 9      |

Any idea/help is appreciated!

Asked By: AWDn0n

||

Source

Answer 1

So, after trying multiple things. A friend came up with the idea of using an inner join. The logic behind this code is:

Isolate the IDs from the movies dataframe
Doing an inner join so that it does what I asked automatically
Dropping the ID column from the start, leftover from the join

movies_id = movies.select("ID")
rats_movies = ratsfull.join(movies_id, ratsfull.anime_id ==  movies.ID, "inner")
rats_movies = rats_movies.drop(rats_movies.ID)
rats_movies.show()

Works wonders and it is not much code. Leaving it here in case anyone might find use for it.

Answered By: AWDn0n

PySpark: Create subset dataframe with rows from another based on a column condition

Question:

Answers: