How to regex extract CAR MAKE from URL in pandas df column

Question:

I am trying to extract from URL str "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte…"
the entire Make name, i.e. "Mercedes-Benz"
BUT my pattern only returns the first letter, i.e. "M"

Please help me come up with the correct pattern to use on pandas df.

Thank you

CODE:
URLS_by_City['Make'] = URLS_by_City['Page'].str.extract('.+([A-Z])w+(?=[/])+', expand=True) Clean_Make = URLS_by_City.dropna(subset=["Make"]) Clean_Make # WENT FROM 5K rows --> to 2688 rows

    Page    City    Pageviews   Unique Pageviews    Avg. Time on Page   Entrances   Bounce Rate % Exit  **Make**
71  /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte...   San Jose    310 149 00:00:27    149 2.00%   47.74%  **B**
103 /used/Audi/2015-Audi-SQ5-286f67180a0e09a872992...   Menlo Park  250 87  00:02:36    82  0.00%   32.40%  **A**
158 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte...   San Francisco   202 98  00:00:18    98  2.04%   48.02%  **B**
165 /used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cf...   San Francisco   194 93  00:00:42    44  2.22%   29.38%  **A**
168 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte...   (not set)   192 91  00:00:11    91  2.20%   47.40%  **B**
... ... ... ... ... ... ... ... ... ...
4995    /used/Subaru/2019-Subaru-Crosstrek-5717b3040a0...   Union City  10  3   00:02:02    0   0.00%   30.00%  **S**
4996    /used/Tesla/2017-Tesla-Model+S-15605a190a0e087...   San Jose    10  5   00:01:29    5   0.00%   50.00%  **T**
4997    /used/Tesla/2018-Tesla-Model+3-0f3ea14d0a0e09a...   Las Vegas   10  4   00:00:09    2   0.00%   40.00%  **T**
4998    /used/Tesla/2018-Tesla-Model+3-0f3ea14d0a0e09a...   Austin  10  4   00:03:29    2   0.00%   40.00%  **T**
4999    /used/Tesla/2018-Tesla-Model+3-5f29cdc70a0e09a...   Orinda  10  4   00:04:00    1   0.00%   0.00%   **T**

TRIED:
`example_url = "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1"
pattern = ".+([a-zA-Z0-9()])w+(?=[/])+"

wanted_make = URLS_by_City['Page'].str.extract(pattern)
wanted_make

`
0
0 r
1 r
2 NaN
3 NaN
4 r
… …
4995 r
4996 l
4997 l
4998 l
4999 l

It worked in regex online tool.

but unfortunately not in my jupyter notebook

EXAMPLE PATTERNS – I bolded what should match:
/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1
/used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cff998e0f96e.htm
/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1
/used/Audi/2021-Audi-RS+5-b92922bd0a0e09a91b4e6e9a29f63e8f.htm
/used/LEXUS/2018-LEXUS-GS+350-dffb145e0a0e09716bd5de4955662450.htm
/used/Porsche/2014-Porsche-Boxster-0423401a0a0e09a9358a179195e076a9.htm
/used/Audi/2014-Audi-A6-1792929d0a0e09b11bc7e218a1fa7563.htm
/used/Honda/2018-Honda-Civic-8e664dd50a0e0a9a43aacb6d1ab64d28.htm
/new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric
/used-inventory/index.htm
/new-inventory/index.htm
/new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric
/

Answers:

I would use:

URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'([^/]+)/d{4}b')

This targets the URL path segment immediately before the portion with the year. You could also try this version:

URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'/[^/]+/([^/]+)')
Answered By: Tim Biegeleisen

I have tried completing your requirement in Jupyter Notebook.

PFB the code and screenshots:

  1. I have created a dummy pandas dataframe(data_df), below is a screenshot of the same

enter image description here

  1. I have created a pattern based on the pattern of the string to be extracted

    pattern = "^/used/(.*)/(?=[20][0-9{2}])"

  2. Used the patten to extract required data from the URLs and saved it in another column in the same dataframe

    data_df[‘Car Maker’] = data_df[‘urls’].str.extract(pattern)

  3. Below is a screenshot of the output

enter image description here

I hope this is helpful..

Answered By: Sachin Nayak

Below code will give you the model & VIN values:

pattern2 = '^/used/[a-zA-Z-]*/([0-9]{4}[a-zA-Z0-9-+]*)-[a-z0-9]*.htm'
pattern3 = '^/used/[a-zA-Z-]*/[0-9]{4}[a-zA-Z0-9-+]*-([a-z0-9]*).htm'

data_df['Model'] = data_df['urls'].str.extract(pattern2)
data_df['VIN'] = data_df['urls'].str.extract(pattern3)

Here is a screenshot of the output:

enter image description here

Answered By: Sachin Nayak
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.