Understanding Regex Patterns for URL Manipulation
Introduction
In this article, we’ll explore how to manipulate URLs using regular expressions (regex) in Python. We’ll focus on the basics of regex patterns and apply them to extract domain information from URLs.
What is a Regular Expression?
A regular expression (regex) is a pattern used to match character combinations in strings. Regex patterns are used extensively in text processing, data validation, and extraction tasks.
The Problem: Manipulating URLs with Regex
The problem at hand involves manipulating URLs by adding the protocol (http or https) and/or path components using regex patterns. In this case, we’re given a pandas Series containing messy URL data.
import re
import pandas as pd
import numpy as np
# Example URL data
example = pd.Series(['None', 'http://fakeurl.com/example/fakeurl',
'https://www.qwer.com/example/qwer', 'None',
'test.com/example/test', 'None', '123135123',
'nourlhere', 'lol', 'hello.tv', 'nolink',
'ihavenowebsite.com'])
Solving the Problem: Refactored Code
The refactored code uses urllib.parse to finalize the URL manipulation.
import re
import urllib.parse
import pandas as pd
import numpy as np
# Example URL data
example = pd.Series(['None', 'http://fakeurl.com/example/fakeurl',
'https://www.qwer.com/example/qwer', 'None',
'test.com/example/test', 'None', '123135123',
'nourlhere', 'lol', 'hello.tv', 'nolink',
'ihavenowebsite.com'])
# Regex patterns
re1 = r'([-a-zA-Z0-9\u0080-\u024F@:%._\+~#=]{1,256})\.([a-z]{2,5})([/.]*)'
re3 = r'www\.([\w]*)'
def modurl(s):
u = urllib.parse.urlparse(s)
if u.netloc=="" or u.path!="/example":
return s
else:
return f"{s}/{urllib.parse.urlparse(s).netloc.split('.')[-2]}.{urllib.parse.urlparse(s).netloc.split('.')[-1]}"
# URL manipulation
example = (example
.map(lambda x: x.replace('https://www.', ''))
.map(lambda x: x.replace('www.', ''))
.map(lambda x: x.replace('https://', ''))
.map(lambda x: x.replace('http://', ''))
.map(lambda x: re.search(re1, x) and "http://www."+x or x)
.map(lambda x: re.match(re3, x) and f"{x}/{urllib.parse.urlparse(x).netloc.split('.')[-2]}.{urllib.parse.urlparse(x).netloc.split('.')[-1]}" or x)
.map(lambda x: modurl(x))
)
# Print the result
print(example.to_string())
Explanation
The refactored code uses urllib.parse to break down URLs into their constituent parts, such as protocol (scheme), domain, and path. We then use regex patterns to manipulate these components.
Regex Pattern 1 (re1): Matches the pattern of a URL with a top-level domain (TLD) component.
([-a-zA-Z0-9\u0080-\u024F@:%._\+~#=]{1,256})\.([a-z]{2,5})([/.]*)
Regex Pattern 2 (re3): Matches the pattern of a URL with a TLD component.
www\.([\w]*)
The modurl function uses these regex patterns to manipulate URLs. It checks for the presence of the protocol (http or https) and/or path components using re.search and re.match, respectively. If both components are present, it constructs a new URL with the format {original_url}/{domain}/path. Otherwise, it returns the original URL.
Conclusion
In this article, we explored how to manipulate URLs using regex patterns in Python. We applied these techniques to extract domain information from URLs and construct new URLs with added protocol (http or https) and/or path components. The refactored code uses urllib.parse to simplify the URL manipulation process.
Last modified on 2023-12-01