Removing Duplicate Domain URLs From the Text File Using Bash

Question

Text file

https://www.google.com/1/
https://www.google.com/2/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
https://www.bing.com/3/

Expected Output:

https://www.google.com/1/
https://www.bing.com

What I Tried

awk -F'/' '!a[$3]++' $file;

Output

https://www.google.com/1/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/

I already tried various codes and none of them work as expected. I just want to pick only one unique domain URL per domain from the list.

Please tell me how I can do it by using the Bash script or Python.

PS: I want to filter and save full URLs from the list and not only the root domain.

Asked By: Praveen Kumar

||

Source

Answer 1

With awk and / as field separator:

awk -F '/' '!seen[$3]++' file

If your file contains Windows line breaks (carriage returns) then I suggest:

dos2unix < file | awk -F '/' '!seen[$3]++'

Output:

https://www.google.com/1/
https://www.bing.com

Answered By: Cyrus

Answer 2

Python solution using one of Itertools Recipes and urllib.parse.urlparse, let file.txt content be

https://www.google.com/1/
https://www.google.com/2/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
https://www.bing.com/3/

then

from itertools import filterfalse
from urllib.parse import urlparse

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBcCAD', str.lower) --> A B c D
    seen = set()
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen.add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen.add(k)
                yield element


def get_netloc(url):
    return urlparse(url).netloc

with open("file.txt","r") as fin:
    with open("file_uniq.txt","w") as fout:
        for line in unique_everseen(fin,key=get_netloc):
            fout.write(line)

creates file file_uniq.txt with following content

https://www.google.com/1/
https://www.bing.com

Answered By: Daweo

Removing Duplicate Domain URLs From the Text File Using Bash

Question:

Answers: