Spark join throws 'function' object has no attribute '_get_object_id' error. How could I fix it?

Question:

I am making a query in Spark in Databricks, and I have a problema when I am trying to make a join between two dataframes. The two dataframes that I have are the next ones:

  • “names_df” which has 2 columns: “ID”, “title” that refer to the id and the title of films.

    +-------+-----------------------------+
    |ID     |title                        |
    +-------+-----------------------------+
    |1      |Toy Story                    |
    |2      |Jumanji                      |
    |3      |Grumpier Old Men             |
    +-------+-----------------------------+
    
  • “info” which has 3 columns: “movieId”, “count”, “average” that refer to the id of the film, the number of ranks that it has, and the average of those ratings.

    +-------+-----+------------------+
    |movieId|count|average           |
    +-------+-----+------------------+
    |1831   |7463 |2.5785207021305103|
    |431    |8946 |3.695059244355019 |
    |631    |2193 |2.7273141814865483|
    +-------+-----+------------------+
    

This “info” dataframe was created this way:

info =  ratings_df.groupBy('movieId').agg(F.count(ratings_df.rating).alias("count"), F.avg(ratings_df.rating).alias("average"))

Where “ratings_df” is another dataframe that contains 3 columns: “userId”, “movieId” and “rating”, that refer to the id of the user that voted, the id of the film that the user voted to, and the rating for that film:

+-------+-------+-------------+
|userId |movieId|rating       |
+-------+-------+-------------+
|1      |2      |3.5          |
|1      |29     |3.5          |
|1      |32     |3.5          |
+-------+-------+-------------+

I am trying to make a join between these two dataframes to get another one with those columns: “movieId”, “title”, “count”, “average”:

+-------+-----------------------------+-----+-------+
|average|title                        |count|movieId|
+-------+-----------------------------+-----+-------+
|5.0    |Ella Lola, a la Trilby (1898)|1    |94431  |
|5.0    |Serving Life (2011)          |1    |129034 |
|5.0    |Diplomatic Immunity (2009? ) |1    |107434 |
+-------+-----------------------------+-----+-------+

So the operation I did was the next one:

movie_names_df = info.join(movies_df, info.movieId == movies_df.ID, "inner").select(movies_df.title, info.average, info.movieId, info.count).show()

The problem is that I get the next error message:

AttributeError: 'function' object has no attribute '_get_object_id'

And I know that this error occurs because it consider that info.count is a function, and not an attribute, as I defined previously.

So, how could I make that join correctly to get what I want?

Thank you so much!

Asked By: jartymcfly

||

Answers:

Adding comment as answer since it solved the problem. count is somewhat of a protected keyword in DataFrame API, so naming columns count is dangerous. In your case you could circumvent the error by not using the dot notation, but bracket based column access, e.g.

info["count"]
Answered By: LiMuBei

Try to get info.count as a function call info.count().

movie_names_df = info.join(movies_df, info.movieId == movies_df.ID, "inner").select(movies_df.title, info.average, info.movieId, info.count()).show()
Answered By: Samyuktha