UnicodeDecodeError for a Python script to extract URLs from a batch of JSON files

Question:

I’m trying to make a script for extracting URLs from a batch of JSON files but I’m getting this error I can’t figure out how to resolve:
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x98 in position 7837: character maps to

It might be related to encoding but I’m not sure which one is needed and how to set it.

import json
import os
import re

# Directory containing JSON files
json_dir = r'C:/Users/manis/Downloads/cH/URLextract/Italy'

# Regular expression pattern for URLs
url_pattern = r'https?://(?:[-w.]|(?:%[da-fA-F]{2}))+'

# Loop through all JSON files in the directory
for filename in os.listdir(json_dir):
    if filename.endswith('.json'):
        with open(os.path.join(json_dir, filename), 'r') as f:
            # Load JSON data
            data = json.load(f)

            # Convert data to string
            data_str = json.dumps(data)

            # Extract URLs using regular expression pattern
            urls = re.findall(url_pattern, data_str)

            # Print URLs
            print(urls)
Asked By: Vlad G

||

Answers:

The error message you’re seeing suggests that the problem may be related to the encoding used by the files you’re trying to read. You can try specifying the encoding explicitly when you open the file using the encoding parameter.

For example, you can try changing this line:

with open(os.path.join(json_dir, filename), 'r') as f:

to:

with open(os.path.join(json_dir, filename), 'r', encoding='utf-8') as f:

This specifies that the file should be read using the UTF-8 encoding, which is a common encoding for JSON files. You may need to use a different encoding if the files are encoded differently.

If you’re still having trouble, you can try catching the UnicodeDecodeError exception and printing out the filename that caused the error, like this:

with open(os.path.join(json_dir, filename), 'r', encoding='utf-8') as f:
    try:
        data = json.load(f)
    except UnicodeDecodeError:
        print(f"Error reading file: {filename}")

This will print the filename of any file that causes a UnicodeDecodeError when reading.

Answered By: Anthony
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.