UnicodeDecodeError for a Python script to extract URLs from a batch of JSON files
Question:
I’m trying to make a script for extracting URLs from a batch of JSON files but I’m getting this error I can’t figure out how to resolve:
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x98 in position 7837: character maps to
It might be related to encoding but I’m not sure which one is needed and how to set it.
import json
import os
import re
# Directory containing JSON files
json_dir = r'C:/Users/manis/Downloads/cH/URLextract/Italy'
# Regular expression pattern for URLs
url_pattern = r'https?://(?:[-w.]|(?:%[da-fA-F]{2}))+'
# Loop through all JSON files in the directory
for filename in os.listdir(json_dir):
if filename.endswith('.json'):
with open(os.path.join(json_dir, filename), 'r') as f:
# Load JSON data
data = json.load(f)
# Convert data to string
data_str = json.dumps(data)
# Extract URLs using regular expression pattern
urls = re.findall(url_pattern, data_str)
# Print URLs
print(urls)
Answers:
The error message you’re seeing suggests that the problem may be related to the encoding used by the files you’re trying to read. You can try specifying the encoding explicitly when you open the file using the encoding parameter.
For example, you can try changing this line:
with open(os.path.join(json_dir, filename), 'r') as f:
to:
with open(os.path.join(json_dir, filename), 'r', encoding='utf-8') as f:
This specifies that the file should be read using the UTF-8 encoding, which is a common encoding for JSON files. You may need to use a different encoding if the files are encoded differently.
If you’re still having trouble, you can try catching the UnicodeDecodeError exception and printing out the filename that caused the error, like this:
with open(os.path.join(json_dir, filename), 'r', encoding='utf-8') as f:
try:
data = json.load(f)
except UnicodeDecodeError:
print(f"Error reading file: {filename}")
This will print the filename of any file that causes a UnicodeDecodeError when reading.
I’m trying to make a script for extracting URLs from a batch of JSON files but I’m getting this error I can’t figure out how to resolve:
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x98 in position 7837: character maps to
It might be related to encoding but I’m not sure which one is needed and how to set it.
import json
import os
import re
# Directory containing JSON files
json_dir = r'C:/Users/manis/Downloads/cH/URLextract/Italy'
# Regular expression pattern for URLs
url_pattern = r'https?://(?:[-w.]|(?:%[da-fA-F]{2}))+'
# Loop through all JSON files in the directory
for filename in os.listdir(json_dir):
if filename.endswith('.json'):
with open(os.path.join(json_dir, filename), 'r') as f:
# Load JSON data
data = json.load(f)
# Convert data to string
data_str = json.dumps(data)
# Extract URLs using regular expression pattern
urls = re.findall(url_pattern, data_str)
# Print URLs
print(urls)
The error message you’re seeing suggests that the problem may be related to the encoding used by the files you’re trying to read. You can try specifying the encoding explicitly when you open the file using the encoding parameter.
For example, you can try changing this line:
with open(os.path.join(json_dir, filename), 'r') as f:
to:
with open(os.path.join(json_dir, filename), 'r', encoding='utf-8') as f:
This specifies that the file should be read using the UTF-8 encoding, which is a common encoding for JSON files. You may need to use a different encoding if the files are encoded differently.
If you’re still having trouble, you can try catching the UnicodeDecodeError exception and printing out the filename that caused the error, like this:
with open(os.path.join(json_dir, filename), 'r', encoding='utf-8') as f:
try:
data = json.load(f)
except UnicodeDecodeError:
print(f"Error reading file: {filename}")
This will print the filename of any file that causes a UnicodeDecodeError when reading.