How do I split a string to extract only uppercase string or uppercase followed by float?
Question:
I am using Selenium with Python to scrape some file information. I would like to extract only the file type and version number if available eg. GML 3.1.1
. I’m looking for the split function to do so. My current response is a list that looks like this:
ESRI Shapefile, (50.7 kB)
GML 3.1.1, (124.9 kB)
Google Earth KML 2.1, (126.5 kB)
MapInfo MIF, (53.5 kB)
The script section is as follows:
for file in files:
file_format = file.text
print(file_format)
I’m looking for the strip()
function that checks if the word before the comma is uppercase or uppercase followed by float. The following is the output I’m looking for:
ESRI
GML 3.1.1
KML 2.1
MIF
Answers:
Using a regex that finds words of all uppercase letters followed optionally by a space and digits / dots would work here:
s='''ESRI Shapefile, (50.7 kB)
GML 3.1.1, (124.9 kB)
Google Earth KML 2.1, (126.5 kB)
MapInfo MIF, (53.5 kB)'''
import re
re.findall(r'b[A-Z]+b(?:s[d.]+)?', s)
['ESRI', 'GML 3.1.1', 'KML 2.1', 'MIF']
I am using Selenium with Python to scrape some file information. I would like to extract only the file type and version number if available eg. GML 3.1.1
. I’m looking for the split function to do so. My current response is a list that looks like this:
ESRI Shapefile, (50.7 kB)
GML 3.1.1, (124.9 kB)
Google Earth KML 2.1, (126.5 kB)
MapInfo MIF, (53.5 kB)
The script section is as follows:
for file in files:
file_format = file.text
print(file_format)
I’m looking for the strip()
function that checks if the word before the comma is uppercase or uppercase followed by float. The following is the output I’m looking for:
ESRI
GML 3.1.1
KML 2.1
MIF
Using a regex that finds words of all uppercase letters followed optionally by a space and digits / dots would work here:
s='''ESRI Shapefile, (50.7 kB)
GML 3.1.1, (124.9 kB)
Google Earth KML 2.1, (126.5 kB)
MapInfo MIF, (53.5 kB)'''
import re
re.findall(r'b[A-Z]+b(?:s[d.]+)?', s)
['ESRI', 'GML 3.1.1', 'KML 2.1', 'MIF']