Find USA phone numbers in python script
Question:
the following python script allows me to scrape email addresses from a given file using regular expressions.
How could I add to this so that I can also get phone numbers? Say, if it was either the 7 digit or 10 digit (with area code), and also account for parenthesis?
My current script can be found below:
# filename variables
filename = 'file.txt'
newfilename = 'result.txt'
# read the file
if os.path.exists(filename):
data = open(filename,'r')
bulkemails = data.read()
else:
print "File not found."
raise SystemExit
# regex = [email protected]
r = re.compile(r'(b[w.]+@+[w.]+.+[w.]b)')
results = r.findall(bulkemails)
emails = ""
for x in results:
emails += str(x)+"n"
# function to write file
def writefile():
f = open(newfilename, 'w')
f.write(emails)
f.close()
print "File written."
Regex for phone numbers:
(d{3}[-.s]d{3}[-.s]d{4}|(d{3})s*d{3}[-.s]d{4}|d{3}[-.s]d{4})
Another regex for phone numbers:
(?:(?:+?1s*(?:[.-]s*)?)?(?:(s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])s*)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))s*(?:[.-]s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})s*(?:[.-]s*)?([0-9]{4})(?:s*(?:#|x.?|ext.?|extension)s*(d+))?
Answers:
If you are interested in learning Regex, you could take a stab at writing it yourself. It’s not quite as hard as it’s made out to be. Sites like RegexPal allow you to enter some test data, then write and test a Regular Expression against that data. Using RegexPal, try adding some phone numbers in the various formats you expect to find them (with brackets, area codes, etc), grab a Regex cheatsheet and see how far you can get. If nothing else, it will help in reading other peoples Expressions.
Edit:
Here is a modified version of your Regex, which should also match 7 and 10-digit phone numbers that lack any hyphens, spaces or dots. I added question marks after the character classes (the []s), which makes anything within them optional. I tested it in RegexPal, but as I’m still learning Regex, I’m not sure that it’s perfect. Give it a try.
(d{3}[-.s]??d{3}[-.s]??d{4}|(d{3})s*d{3}[-.s]??d{4}|d{3}[-.s]??d{4})
It matched the following values in RegexPal:
000-000-0000
000 000 0000
000.000.0000
(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000
000-0000
000 0000
000.0000
0000000
0000000000
(000)0000000
This is the process of building a phone number scraping regex.
First, we need to match an area code (3 digits), a trunk (3 digits), and an extension (4 digits):
reg = re.compile("d{3}d{3}d{4}")
Now, we want to capture the matched phone number, so we add parenthesis around the parts that we’re interested in capturing (all of it):
reg = re.compile("(d{3}d{3}d{4})")
The area code, trunk, and extension might be separated by up to 3 characters that are not digits (such as the case when spaces are used along with the hyphen/dot delimiter):
reg = re.compile("(d{3}D{0,3}d{3}D{0,3}d{4})")
Now, the phone number might actually start with a (
character (if the area code is enclosed in parentheses):
reg = re.compile("((?d{3}D{0,3}d{3}D{0,3}d{4}).*?")
Now that whole phone number is likely embedded in a bunch of other text:
reg = re.compile(".*?((?d{3}D{0,3}d{3}D{0,3}d{4}).*?")
Now, that other text might include newlines:
reg = re.compile(".*?((?d{3}D{0,3}d{3}D{0,3}d{4}).*?", re.S)
Enjoy!
I personally stop here, but if you really want to be sure that only spaces, hyphens, and dots are used as delimiters then you could try the following (untested):
reg = re.compile(".*?((?d{3})? ?[.-]? ?d{3} ?[.-]? ?d{4}).*?", re.S)
I think this regex is very simple for parsing phone numbers
re.findall("[(][d]{3}[)][ ]?[d]{3}-[d]{4}", lines)
For spanish phone numbers I use this with quite success:
re.findall( r'[697]d{1,2}.d{2,3}.d{2,3}.d{0,2}',str)
You can check : http://regex.inginf.units.it/. With some training data and target, it constructs you an appropriate regex. It is not always perfect (check F-score). Let’s try it with 15 examples :
re.findall("wd ww ww ww wd|(?<=[^d][^_][^_] )[^_]d[^ ]d[^ ][^ ]+|(?<= [^<]ww ww[^:]w[^_][^ ][^,][^_] )(?: *[^<]d+)+",
"""Lorem ipsum © 04-42-00-00-00 dolor 1901 sit amet, consectetur +33 (0)4 42 00 00 00 adipisicing elit. 2016 Sapiente dicta fugit fugiat hic 04 42 00 00 00 aliquam itaque 04.42.00.00.00 facere, 13205 number: 100 000 000 00013 soluta. 4 Totam id dolores!""")
returns ['04 42 00 00 00', '04.42.00.00.00', '04-42-00-00-00', '50498,']
add more examples to gain precision
Since nobody has posted this regex yet, I will. This is what I use to find phone numbers. It matches all regular phone number formats you see in the United States. I did not need this regex to match international numbers so I didn’t make adjustments to regex for that purpose.
phone_number_regex_pattern = r"(?d{3})?[-.s]d{3}[-.s]d{4}"
Use this pattern if you want simple phone numbers with no characters in between to match. An example of this would be: “4441234567”.
phone_number_regex_pattern = r"(?d{3})?[-.s]?d{3}[-.s]?d{4}"
Below is completion of the answers above. This regex is also able to detect country code:
((?:+d{2}[-.s]??|d{4}[-.s]??)?(?:d{3}[-.s]??d{3}[-.s]??d{4}|(d{3})s*d{3}[-.s]??d{4}|d{3}[-.s]??d{4}))
It can detect the samples below:
000-000-0000
000 000 0000
000.000.0000
(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000
000-0000
000 0000
000.0000
0000000
0000000000
(000)0000000
# Detect phone numbers with country code
+00 000 000 0000
+00.000.000.0000
+00-000-000-0000
+000000000000
0000 0000000000
0000-000-000-0000
00000000000000
+00 (000)000 0000
0000 (000)000-0000
0000(000)000-0000
Updated as of 03.05.2022:
I fixed some issues in the phone numbers detection regex above, you find it in the link below. Complete the regex to include more country codes.
//search phone number using regex in python
//form the regex according to your output
// with this you can get single mobile number
phoneRegex = re.compile(r"ddd-ddd-dddd")
Mobile = phoneRegex.search("my number is 123-456-6789")
print(Mobile.group())
Output: 123-456-6789
phoneRegex1 = re.compile(r"(ddd-)?ddd-dddd")
Mobile1 = phoneRegex1.search("my number is 123-456-6789")
print(Mobile1.group())
Output: 123-456-789
Mobile1 = phoneRegex1.search("my number is 456-6789")
print(Mobile1.group())
Output: 456-678
While these are simple solutions they are all incorrect for North America. The problem lies in the fact that area-code and exchange numbers cannot start with a zero or a one.
r"(\(?[2-9]d{2}\)?[ -])?[2-9]d{2}-d{4}"
would be the correct way to parse a 7 or 10-digit phone number.
(202) 555-4111
(202)-555-4111
202-555-4111
555-4111
will all parse correctly.
Use this code to find the number like "416-676-4560"
doc=browser.page_source
phones=re.findall(r'[d]{3}-[d]{3}-[d]{4}',doc)
the following python script allows me to scrape email addresses from a given file using regular expressions.
How could I add to this so that I can also get phone numbers? Say, if it was either the 7 digit or 10 digit (with area code), and also account for parenthesis?
My current script can be found below:
# filename variables
filename = 'file.txt'
newfilename = 'result.txt'
# read the file
if os.path.exists(filename):
data = open(filename,'r')
bulkemails = data.read()
else:
print "File not found."
raise SystemExit
# regex = [email protected]
r = re.compile(r'(b[w.]+@+[w.]+.+[w.]b)')
results = r.findall(bulkemails)
emails = ""
for x in results:
emails += str(x)+"n"
# function to write file
def writefile():
f = open(newfilename, 'w')
f.write(emails)
f.close()
print "File written."
Regex for phone numbers:
(d{3}[-.s]d{3}[-.s]d{4}|(d{3})s*d{3}[-.s]d{4}|d{3}[-.s]d{4})
Another regex for phone numbers:
(?:(?:+?1s*(?:[.-]s*)?)?(?:(s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])s*)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))s*(?:[.-]s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})s*(?:[.-]s*)?([0-9]{4})(?:s*(?:#|x.?|ext.?|extension)s*(d+))?
If you are interested in learning Regex, you could take a stab at writing it yourself. It’s not quite as hard as it’s made out to be. Sites like RegexPal allow you to enter some test data, then write and test a Regular Expression against that data. Using RegexPal, try adding some phone numbers in the various formats you expect to find them (with brackets, area codes, etc), grab a Regex cheatsheet and see how far you can get. If nothing else, it will help in reading other peoples Expressions.
Edit:
Here is a modified version of your Regex, which should also match 7 and 10-digit phone numbers that lack any hyphens, spaces or dots. I added question marks after the character classes (the []s), which makes anything within them optional. I tested it in RegexPal, but as I’m still learning Regex, I’m not sure that it’s perfect. Give it a try.
(d{3}[-.s]??d{3}[-.s]??d{4}|(d{3})s*d{3}[-.s]??d{4}|d{3}[-.s]??d{4})
It matched the following values in RegexPal:
000-000-0000
000 000 0000
000.000.0000
(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000
000-0000
000 0000
000.0000
0000000
0000000000
(000)0000000
This is the process of building a phone number scraping regex.
First, we need to match an area code (3 digits), a trunk (3 digits), and an extension (4 digits):
reg = re.compile("d{3}d{3}d{4}")
Now, we want to capture the matched phone number, so we add parenthesis around the parts that we’re interested in capturing (all of it):
reg = re.compile("(d{3}d{3}d{4})")
The area code, trunk, and extension might be separated by up to 3 characters that are not digits (such as the case when spaces are used along with the hyphen/dot delimiter):
reg = re.compile("(d{3}D{0,3}d{3}D{0,3}d{4})")
Now, the phone number might actually start with a (
character (if the area code is enclosed in parentheses):
reg = re.compile("((?d{3}D{0,3}d{3}D{0,3}d{4}).*?")
Now that whole phone number is likely embedded in a bunch of other text:
reg = re.compile(".*?((?d{3}D{0,3}d{3}D{0,3}d{4}).*?")
Now, that other text might include newlines:
reg = re.compile(".*?((?d{3}D{0,3}d{3}D{0,3}d{4}).*?", re.S)
Enjoy!
I personally stop here, but if you really want to be sure that only spaces, hyphens, and dots are used as delimiters then you could try the following (untested):
reg = re.compile(".*?((?d{3})? ?[.-]? ?d{3} ?[.-]? ?d{4}).*?", re.S)
I think this regex is very simple for parsing phone numbers
re.findall("[(][d]{3}[)][ ]?[d]{3}-[d]{4}", lines)
For spanish phone numbers I use this with quite success:
re.findall( r'[697]d{1,2}.d{2,3}.d{2,3}.d{0,2}',str)
You can check : http://regex.inginf.units.it/. With some training data and target, it constructs you an appropriate regex. It is not always perfect (check F-score). Let’s try it with 15 examples :
re.findall("wd ww ww ww wd|(?<=[^d][^_][^_] )[^_]d[^ ]d[^ ][^ ]+|(?<= [^<]ww ww[^:]w[^_][^ ][^,][^_] )(?: *[^<]d+)+",
"""Lorem ipsum © 04-42-00-00-00 dolor 1901 sit amet, consectetur +33 (0)4 42 00 00 00 adipisicing elit. 2016 Sapiente dicta fugit fugiat hic 04 42 00 00 00 aliquam itaque 04.42.00.00.00 facere, 13205 number: 100 000 000 00013 soluta. 4 Totam id dolores!""")
returns ['04 42 00 00 00', '04.42.00.00.00', '04-42-00-00-00', '50498,']
add more examples to gain precision
Since nobody has posted this regex yet, I will. This is what I use to find phone numbers. It matches all regular phone number formats you see in the United States. I did not need this regex to match international numbers so I didn’t make adjustments to regex for that purpose.
phone_number_regex_pattern = r"(?d{3})?[-.s]d{3}[-.s]d{4}"
Use this pattern if you want simple phone numbers with no characters in between to match. An example of this would be: “4441234567”.
phone_number_regex_pattern = r"(?d{3})?[-.s]?d{3}[-.s]?d{4}"
Below is completion of the answers above. This regex is also able to detect country code:
((?:+d{2}[-.s]??|d{4}[-.s]??)?(?:d{3}[-.s]??d{3}[-.s]??d{4}|(d{3})s*d{3}[-.s]??d{4}|d{3}[-.s]??d{4}))
It can detect the samples below:
000-000-0000
000 000 0000
000.000.0000
(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000
000-0000
000 0000
000.0000
0000000
0000000000
(000)0000000
# Detect phone numbers with country code
+00 000 000 0000
+00.000.000.0000
+00-000-000-0000
+000000000000
0000 0000000000
0000-000-000-0000
00000000000000
+00 (000)000 0000
0000 (000)000-0000
0000(000)000-0000
Updated as of 03.05.2022:
I fixed some issues in the phone numbers detection regex above, you find it in the link below. Complete the regex to include more country codes.
//search phone number using regex in python
//form the regex according to your output
// with this you can get single mobile number
phoneRegex = re.compile(r"ddd-ddd-dddd")
Mobile = phoneRegex.search("my number is 123-456-6789")
print(Mobile.group())
Output: 123-456-6789
phoneRegex1 = re.compile(r"(ddd-)?ddd-dddd")
Mobile1 = phoneRegex1.search("my number is 123-456-6789")
print(Mobile1.group())
Output: 123-456-789
Mobile1 = phoneRegex1.search("my number is 456-6789")
print(Mobile1.group())
Output: 456-678
While these are simple solutions they are all incorrect for North America. The problem lies in the fact that area-code and exchange numbers cannot start with a zero or a one.
r"(\(?[2-9]d{2}\)?[ -])?[2-9]d{2}-d{4}"
would be the correct way to parse a 7 or 10-digit phone number.
(202) 555-4111
(202)-555-4111
202-555-4111
555-4111
will all parse correctly.
Use this code to find the number like "416-676-4560"
doc=browser.page_source
phones=re.findall(r'[d]{3}-[d]{3}-[d]{4}',doc)