SQL how to prevent duplicates in STRING_AGG when joining multiple tables

Question:

Here’s the relational model for the database that I’m going to talk about (a show can have multiple actors and multiple genres): Relational model
(sorry, don’t have enough reputation to insert an image)

What I would like to achieve is to get every detail about a specific show + the genres that it has + the actors that played in the show and their IDs.

I tried:

SELECT shows.*,
       STRING_AGG(g.name, ', ')                AS genres,
       STRING_AGG(CAST(a.id AS VARCHAR), ', ') AS actor_ids,
       STRING_AGG(a.name, ', ')                AS actors
FROM shows
         LEFT JOIN show_genres sg ON shows.id = sg.show_id
         LEFT JOIN genres g ON g.id = sg.genre_id
         LEFT JOIN show_characters sc ON shows.id = sc.show_id
         LEFT JOIN actors a ON sc.actor_id = a.id
WHERE shows.id = 1390
GROUP BY shows.id

The result:

genres actor_ids actors
Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy 436511, 436512, 412933, 9739, 414839, 436539, 436541, 11439, 419217, 415442, 413734, 432873, 420752, 421890, 436592, 448107, 458178, 150561, 436590, 412041, 436511, 436512, 412933, 9739, 414839, 436539, 436541, 11439, 419217, 415442, 413734, 432873, 420752, 421890, 436592, 448107, 458178, 150561, 436590, 412041, 436511, 436512, 412933, 9739, 414839, 436539, 436541, 11439, 419217, 415442, 413734, 432873, 420752, 421890, 436592, 448107, 458178, 150561, 436590, 412041, 436511, 436512, 412933, 9739, 414839, 436539, 436541, 11439, 419217, 415442, 413734, 432873, 420752, 421890, 436592, 448107, 458178, 150561, 436590, 412041 Emilia Clarke, Kit Harington, Peter Dinklage, Lena Headey, Nikolaj Coster-Waldau, Sophie Turner, Maisie Williams, Iain Glen, Alfie Allen, Liam Cunningham, John Bradley, Conleth Hill, Aidan Gillen, Gwendoline Christie, Isaac Hempstead-Wright, Kristofer Hivju, Nathalie Emmanuel, Jacob Anderson, Jerome Flynn, Rory McCann, Emilia Clarke, Kit Harington, Peter Dinklage, Lena Headey, Nikolaj Coster-Waldau, Sophie Turner, Maisie Williams, Iain Glen, Alfie Allen, Liam Cunningham, John Bradley, Conleth Hill, Aidan Gillen, Gwendoline Christie, Isaac Hempstead-Wright, Kristofer Hivju, Nathalie Emmanuel, Jacob Anderson, Jerome Flynn, Rory McCann, Emilia Clarke, Kit Harington, Peter Dinklage, Lena Headey, Nikolaj Coster-Waldau, Sophie Turner, Maisie Williams, Iain Glen, Alfie Allen, Liam Cunningham, John Bradley, Conleth Hill, Aidan Gillen, Gwendoline Christie, Isaac Hempstead-Wright, Kristofer Hivju, Nathalie Emmanuel, Jacob Anderson, Jerome Flynn, Rory McCann, Emilia Clarke, Kit Harington, Peter Dinklage, Lena Headey, Nikolaj Coster-Waldau, Sophie Turner, Maisie Williams, Iain Glen, Alfie Allen, Liam Cunningham, John Bradley, Conleth Hill, Aidan Gillen, Gwendoline Christie, Isaac Hempstead-Wright, Kristofer Hivju, Nathalie Emmanuel, Jacob Anderson, Jerome Flynn, Rory McCann

Note: only showing the problematic columns

The problem: After joining the "show_characters" and the "actors" table, there are a lot of duplicate data in the string aggregations.

Question: How would I solve this? Without having duplicate data.

My workaround:

@connection_handler
def get_show_by_id(cursor: 'RealDictCursor', id: int) -> 'RealDictRow':
    """

    Args:
        cursor: a cursor which returns dictionaries (use @connection.connection_handler decorator)
        id: number of items shows on a single page

    Returns:
        All show details in a RealDictRow + genres(as a concatenated string) + actors in a list
        containing the actors ids and names in a RealDictRow

    """
    query = """
    SELECT shows.*,STRING_AGG(g.name,',') AS genres
    FROM shows
    LEFT JOIN show_genres sg ON shows.id = sg.show_id
    LEFT JOIN genres g ON sg.genre_id = g.id
    WHERE shows.id = %s
    GROUP BY shows.id
    """
    val = (id,)
    cursor.execute(query, val)
    shows = cursor.fetchone()

    query = """
    SELECT a.id,a.name
    FROM actors a
    RIGHT JOIN show_characters sc ON a.id = sc.actor_id
    RIGHT JOIN shows s ON sc.show_id = s.id
    WHERE s.id = %s
    GROUP BY a.id
    """
    val = (id,)
    cursor.execute(query, val)
    actors = cursor.fetchall()
    shows["actors"] = actors
    return shows

Note for workaround: Used two separate queries to achieve the result I wanted, but I’d like to know how would I be able to do this with one SQL query.

Side note: I’d be really happy to receive any feedback about my code in general, or suggestion about how people usually handle when they get back multiple data from an SQL query, do they just split it? Or is it better to write a separate query, like I did, so I get the data in a separate list right away? What about my docstring? I just started trying to use them, not sure if the way I’m doing it is good or not.

Asked By: Kevin Németh

||

Answers:

One approach in this sort of situation is to STRING_AGG each of the constituent tables (or sets of tables) first; then, LEFT JOIN those contrived tables onto the main table. This sidesteps the problem of multiplication that can occur during consecutive LEFT JOINs.

In your case, try something like this:

SELECT
       shows.*,
       show_genre_names.genre_names,
       show_actors.actor_ids,
       show_actors.actor_names
FROM
       shows
       LEFT JOIN 
              ( -- one row per show_id
              SELECT
                     sg.show_id,
                     STRING_AGG(g.name, ', ') AS genre_names
              FROM
                     show_genres sg
                     JOIN genres g ON g.id = sg.genre_id
              GROUP BY
                     sg.show_id
              ) show_genre_names
              ON shows.id = show_genre_names.show_id 
       LEFT JOIN
              ( -- one row per show_id
              SELECT
                     sc.show_id,
                     STRING_AGG(a.id, ', ') AS actor_ids,
                     STRING_AGG(a.name, ', ') AS actor_names
              FROM
                     show_characters sc
                     JOIN actors a ON a.id = sc.actor_id
              GROUP BY
                     sc.show_id
              ) show_actors
              ON shows.id = show_actors.show_id
WHERE
       shows.id = 1390
;

You can solve this in other ways, too, but understanding this technique will be helpful in your SQL journey.

Answered By: John K.

Meanwhile I’ve been able to come up with a solution myself, just to have an easier time managing the data returned later, I convert the returned array of arrays into a list of dictionaries.

    @connection_handler
    def get_show_by_id(cursor: 'RealDictCursor', id: int) -> 'RealDictRow':
        """
    
        Args:
            cursor: a cursor which returns dictionaries (use @connection.connection_handler decorator)
            id: number of items shows on a single page
    
        Returns:
            All show details in a RealDictRow + genres(as a concatenated string) + actors in a list
            containing the actors ids and names in a dictionary
    
        """
    
        # ONE QUERY SOLUTION:
        # concatenating actors names and ids together to prevent issue when actors share the same name
        # and worked on the same show
        query = """
        SELECT
           shows.*,
           STRING_AGG(DISTINCT g.name, ',')                AS genres,
           ARRAY_AGG(DISTINCT ARRAY[a.id::VARCHAR, a.name]) AS actors_ids_with_names
        FROM shows
                 LEFT JOIN show_genres sg ON shows.id = sg.show_id
                 LEFT JOIN genres g ON g.id = sg.genre_id
                 LEFT JOIN show_characters sc ON shows.id = sc.show_id
                 LEFT JOIN actors a ON sc.actor_id = a.id
        WHERE shows.id=%s
        GROUP BY shows.id
        """
        val = (id,)
        cursor.execute(query, val)
        show = cursor.fetchone()
        actor = [{"id":actor_id_with_name[0], "name": actor_id_with_name[1] } for actor_id_with_name in show["actors_ids_with_names"]]
        show.pop("actors_ids_with_names")
        show["actors"] = actor
        return show
Answered By: Kevin Németh