Identify borders and column contours of table that has no visible outline within an image
Question:
I have a set of images, each containing a table. Some images have the tables in them already aligned and the borders are drawn, it is not hard to identify the main table on those images using Canny edge detection. However, some images have their tables without any borders, so I am trying to identify the table in an image and plot its border’s contours as well as columns.
I am using openCV version 3.4 and the approach i’m generally taking is as follows:
- dilate the grayscale image to identify the text spots
- apply
cv2.findContours
function to get text’s bounding boxes.
- cluster the bounding boxes in case smaller tables were identified instead of the main table.
- try to draw the contours in hopes to identify the borders of the table.
This approach seems to work to a certain extent but the drawn contours are not at all accurate.
img, contours, hierarchy = cv2.findContours(gray_matrix, cv2.RETR_LIST,
cv2.CHAIN_APPROX_SIMPLE)
# get bounding boxes around any text
boxes = []
for contour in contours:
box = cv2.boundingRect(contour)
h = box[3]
rows = {}
cols = {}
# Clustering the bounding boxes by their positions
for box in boxes:
(x, y, w, h) = box
col_key = 10 # cell threshold
row_key = 10 # cell threshold
cols[row_key] = [box] if col_key not in cols else cols[col_key] + [box]
rows[row_key] = [box] if row_key not in rows else rows[row_key] + [box]
# Filtering out the clusters having less than 4 cols
table_cells = list(filter(lambda r: len(r) >= 4, rows.values()))
# Sorting the row cells by x coord
table_cells = [list(sorted(tb)) for tb in table_cells]
table_cells = list(sorted(table_cells, key=lambda r: r[0][1]))
#attempt to identify columns
max_last_col_width_row = max(table_cells, key=lambda b: b[-1][2])
max_x = max_last_col_width_row[-1][0] + max_last_col_width_row[-1][2]
ver_lines = []
for box in table_cells:
x = box[0][0]
y = box[0][1]
hor_lines.append((x, y, max_x, y))
for box in table_cells[0]:
x = box[0]
y = box[1]
ver_lines.append((x, y, x, max_y))
(x, y, w, h) = table_cells[0][-1]
ver_lines.append((max_x, y, max_x, max_y))
(x, y, w, h) = table_cells[0][0]
hor_lines.append((x, max_y, max_x, max_y))
for line in ver_lines:
[x1, y1, x2, y2] = line
cv2.line(output_image, (x1, y1), (x2, y2), (0, 0, 255), 1)
cv2.imshow('Proper Table Borders', output_image)
I am trying to achieve something like the below image.
In short, how can I find the invisible borders of a table-structure in an image as well as identify the x coordinates of the identified table’s columns?
I know the above code is not at all optimal to produce the required outcome, but I am still learning openCV so I’m trying various approaches but still did not reach the desired result.
Answers:
I have a set of images, each containing a table. Some images have the tables in them already aligned and the borders are drawn, it is not hard to identify the main table on those images using Canny edge detection. However, some images have their tables without any borders, so I am trying to identify the table in an image and plot its border’s contours as well as columns.
I am using openCV version 3.4 and the approach i’m generally taking is as follows:
- dilate the grayscale image to identify the text spots
- apply
cv2.findContours
function to get text’s bounding boxes. - cluster the bounding boxes in case smaller tables were identified instead of the main table.
- try to draw the contours in hopes to identify the borders of the table.
This approach seems to work to a certain extent but the drawn contours are not at all accurate.
img, contours, hierarchy = cv2.findContours(gray_matrix, cv2.RETR_LIST,
cv2.CHAIN_APPROX_SIMPLE)
# get bounding boxes around any text
boxes = []
for contour in contours:
box = cv2.boundingRect(contour)
h = box[3]
rows = {}
cols = {}
# Clustering the bounding boxes by their positions
for box in boxes:
(x, y, w, h) = box
col_key = 10 # cell threshold
row_key = 10 # cell threshold
cols[row_key] = [box] if col_key not in cols else cols[col_key] + [box]
rows[row_key] = [box] if row_key not in rows else rows[row_key] + [box]
# Filtering out the clusters having less than 4 cols
table_cells = list(filter(lambda r: len(r) >= 4, rows.values()))
# Sorting the row cells by x coord
table_cells = [list(sorted(tb)) for tb in table_cells]
table_cells = list(sorted(table_cells, key=lambda r: r[0][1]))
#attempt to identify columns
max_last_col_width_row = max(table_cells, key=lambda b: b[-1][2])
max_x = max_last_col_width_row[-1][0] + max_last_col_width_row[-1][2]
ver_lines = []
for box in table_cells:
x = box[0][0]
y = box[0][1]
hor_lines.append((x, y, max_x, y))
for box in table_cells[0]:
x = box[0]
y = box[1]
ver_lines.append((x, y, x, max_y))
(x, y, w, h) = table_cells[0][-1]
ver_lines.append((max_x, y, max_x, max_y))
(x, y, w, h) = table_cells[0][0]
hor_lines.append((x, max_y, max_x, max_y))
for line in ver_lines:
[x1, y1, x2, y2] = line
cv2.line(output_image, (x1, y1), (x2, y2), (0, 0, 255), 1)
cv2.imshow('Proper Table Borders', output_image)
I am trying to achieve something like the below image.
In short, how can I find the invisible borders of a table-structure in an image as well as identify the x coordinates of the identified table’s columns?
I know the above code is not at all optimal to produce the required outcome, but I am still learning openCV so I’m trying various approaches but still did not reach the desired result.