Removing Duplicate Domain URLs From the Text File Using Bash
Question:
Text file
https://www.google.com/1/
https://www.google.com/2/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
https://www.bing.com/3/
Expected Output:
https://www.google.com/1/
https://www.bing.com
What I Tried
awk -F'/' '!a[$3]++' $file;
Output
https://www.google.com/1/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
I already tried various codes and none of them work as expected. I just want to pick only one unique domain URL per domain from the list.
Please tell me how I can do it by using the Bash script or Python.
PS: I want to filter and save full URLs from the list and not only the root domain.
Answers:
With awk
and /
as field separator:
awk -F '/' '!seen[$3]++' file
If your file contains Windows line breaks (carriage returns) then I suggest:
dos2unix < file | awk -F '/' '!seen[$3]++'
Output:
https://www.google.com/1/
https://www.bing.com
Python solution using one of Itertools Recipes and urllib.parse.urlparse
, let file.txt
content be
https://www.google.com/1/
https://www.google.com/2/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
https://www.bing.com/3/
then
from itertools import filterfalse
from urllib.parse import urlparse
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBcCAD', str.lower) --> A B c D
seen = set()
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen.add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen.add(k)
yield element
def get_netloc(url):
return urlparse(url).netloc
with open("file.txt","r") as fin:
with open("file_uniq.txt","w") as fout:
for line in unique_everseen(fin,key=get_netloc):
fout.write(line)
creates file file_uniq.txt
with following content
https://www.google.com/1/
https://www.bing.com
Text file
https://www.google.com/1/
https://www.google.com/2/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
https://www.bing.com/3/
Expected Output:
https://www.google.com/1/
https://www.bing.com
What I Tried
awk -F'/' '!a[$3]++' $file;
Output
https://www.google.com/1/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
I already tried various codes and none of them work as expected. I just want to pick only one unique domain URL per domain from the list.
Please tell me how I can do it by using the Bash script or Python.
PS: I want to filter and save full URLs from the list and not only the root domain.
With awk
and /
as field separator:
awk -F '/' '!seen[$3]++' file
If your file contains Windows line breaks (carriage returns) then I suggest:
dos2unix < file | awk -F '/' '!seen[$3]++'
Output:
https://www.google.com/1/ https://www.bing.com
Python solution using one of Itertools Recipes and urllib.parse.urlparse
, let file.txt
content be
https://www.google.com/1/
https://www.google.com/2/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
https://www.bing.com/3/
then
from itertools import filterfalse
from urllib.parse import urlparse
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBcCAD', str.lower) --> A B c D
seen = set()
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen.add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen.add(k)
yield element
def get_netloc(url):
return urlparse(url).netloc
with open("file.txt","r") as fin:
with open("file_uniq.txt","w") as fout:
for line in unique_everseen(fin,key=get_netloc):
fout.write(line)
creates file file_uniq.txt
with following content
https://www.google.com/1/
https://www.bing.com