hook into the builtin python f-string format machinery

Question:

Summary

I really LOVE f-strings. They’re bloody awesome syntax.

For a while now I’ve had an idea for a little library– described below*- to harness them further. A quick example of what I would like it do:

>>> import simpleformatter as sf
>>> def format_camel_case(string):
...     """camel cases a sentence"""
...     return ''.join(s.capitalize() for s in string.split())
...
>>> @sf.formattable(camcase=format_camel_case)
... class MyStr(str): ...
...
>>> f'{MyStr("lime cordial delicious"):camcase}'
'LimeCordialDelicious'

It would be immensely useful– for the purposes of a simplified API, and extending usage to built-in class instances– to find a way to hook into the builtin python formatting machinery, which would allow the custom format specification of built-ins:

>>> f'{"lime cordial delicious":camcase}'
'LimeCordialDelicious'

In other words, I’d like to override the built in format function (which is used by the f-string syntax)– or alternatively, extend the built-in __format__ methods of existing standard library classes– such that I could write stuff like this:

for x, y, z in complicated_generator:
    eat_string(f"x: {x:custom_spec1}, y: {x:custom_spec2}, z: {x:custom_spec3}")

I have accomplished this by creating subclasses with their own __format__ methods, but of course this will not work for built-in classes.

I could get close to it using the string.Formatter api:

my_formatter=MyFormatter()  # custom string.Formatter instance

format_str = "x: {x:custom_spec1}, y: {x:custom_spec2}, z: {x:custom_spec3}"

for x, y, z in complicated_generator:
    eat_string(my_formatter.format(format_str, **locals()))

I find this to be a tad clunky, and definitely not readable compared to the f-string api.

Another thing that could be done is overriding builtins.format:

>>> import builtins
>>> builtins.format = lambda *args, **kwargs: 'womp womp'
>>> format(1,"foo")
'womp womp'

…but this doesn’t work for f-strings:

>>> f"{1:foo}"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: Invalid format specifier

Details

Currently my API looks something like this (somewhat simplified):

import simpleformatter as sf
@sf.formatter("this_specification")
def this_formatting_function(some_obj):
    return "this formatted someobj!"

@sf.formatter("that_specification")
def that_formatting_function(some_obj):
    return "that formatted someobj!"

@sf.formattable
class SomeClass: ...

After which you can write code like this:

some_obj = SomeClass()
f"{some_obj:this_specification}"
f"{some_obj:that_specification}"

I would like the api to be more like the below:

@sf.formatter("this_specification")
def this_formatting_function(some_obj):
    return "this formatted someobj!"

@sf.formatter("that_specification")
def that_formatting_function(some_obj):
    return "that formatted someobj!"

class SomeClass: ...  # no class decorator needed

…and allow use of custom format specs on built-in classes:

x=1  # built-in type instance
f"{x:this_specification}"
f"{x:that_specification}"

But in order to do these things, we have to burrow our way into the built-in format() function. How can I hook into that juicy f-string goodness?

* NOTE: I’ll probably never actually get around to implementing this library! But I do think it’s a neat idea and invite anyone who wants to, to steal it from me :).

Answers:

Overview

You can, but only if you write evil code that probably should never end up in production software. So let’s get started!

I’m not going to integrate it into your library, but I will show you how to hook into the behavior of f-strings. This is roughly how it’ll work:

  1. Write a function that manipulates the bytecode instructions of code objects to replace FORMAT_VALUE instructions with calls to a hook function;
  2. Customize the import mechanism to make sure that the bytecode of every module and package (except standard library modules and site-packages) is modified with that function.

You can get the full source at https://github.com/mivdnber/formathack, but everything is explained below.

Disclaimer

This solution isn’t great, because

  1. There’s no guarantee at all that this won’t break totally unrelated code;
  2. There’s no guarantee that the bytecode manipulations described here will continue working in newer Python versions. It definitely won’t work in alternative Python implementations that don’t compile to CPython compatible bytecode. PyPy could work in theory, but the solution described here doesn’t because the bytecode package isn’t 100% compatible.

However, it is a solution, and bytecode manipulation has been used succesfully in popular packages like PonyORM. Just keep in mind that it’s hacky, complicated and probably maintenance heavy.

Part 1: Bytecode manipulation

Python code is not executed directly, but is first compiled to a simpler intermediairy, non-human readable stack based language called Python bytecode (it’s what’s inside *.pyc files). To get an idea of what that bytecode looks like, you can use the standard library dis module to inspect the bytecode of a simple function:

def invalid_format(x):
    return f"{x:foo}"

Calling this function will cause an exception, but we’ll "fix" that soon.

>>> invalid_format("bar")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in invalid_format
ValueError: Invalid format specifier

To inspect the bytecode, fire up a Python console and call dis.dis:

>>> import dis
>>> dis.dis(invalid_format)
  2           0 LOAD_FAST                0 (x)
              2 LOAD_CONST               1 ('foo')
              4 FORMAT_VALUE             4 (with format)
              6 RETURN_VALUE

I’ve annotated the output below to explain what’s happening:

# line 2      # Put the value of function parameter x on the stack
  2           0 LOAD_FAST                0 (x)
              # Put the format spec on the stack as a string
              2 LOAD_CONST               1 ('foo')
              # Pop both values from the stack and perform the actual formatting
              # This puts the formatted string on the stack
              4 FORMAT_VALUE             4 (with format)
              # pop the result from the stack and return it
              6 RETURN_VALUE

The idea here is to replace the FORMAT_VALUE instruction with a call to a hook function that allows us to implement whatever behavior we want. Let’s implement it like this for now:

def formathack_hook__(value, format_spec=None):
    """
    Gets called whenever a value is formatted. Right now it's a silly implementation,
    but it can be expanded with all sorts of nasty hacks.
    """
    return f"{value} formatted with {format_spec}"

To replace the instruction, I used the bytecode package, which provides surprisingly nice abstractions for doing horrible things.

from bytecode import Bytecode
def formathack_rewrite_bytecode__(code):
    """
    Modifies a code object to override the behavior of the FORMAT_VALUE
    instructions used by f-strings.
    """
    decompiled = Bytecode.from_code(code)
    modified_instructions = []
    for instruction in decompiled:
        name = getattr(instruction, 'name', None)
        if name == 'FORMAT_VALUE':
            # 0x04 means that a format spec is present
            if instruction.arg & 0x04 == 0x04:
                callback_arg_count = 2
            else:
                callback_arg_count = 1
            modified_instructions.extend([
                # Load in the callback
                Instr("LOAD_GLOBAL", "formathack_hook__"),
                # Shuffle around the top of the stack to put the arguments on top
                # of the function global
                Instr("ROT_THREE" if callback_arg_count == 2 else "ROT_TWO"),
                # Call the callback function instead of executing FORMAT_VALUE
                Instr("CALL_FUNCTION", callback_arg_count)
            ])
        # Kind of nasty: we want to recursively alter the code of functions.
        elif name == 'LOAD_CONST' and isinstance(instruction.arg, types.CodeType):
            modified_instructions.extend([
                Instr("LOAD_CONST", formathack_rewrite_bytecode__(instruction.arg), lineno=instruction.lineno)
            ])
        else:
            modified_instructions.append(instruction)
    modified_bytecode = Bytecode(modified_instructions)
    # For functions, copy over argument definitions
    modified_bytecode.argnames = decompiled.argnames
    modified_bytecode.argcount = decompiled.argcount
    modified_bytecode.name = decompiled.name
    return modified_bytecode.to_code()

We can now make the invalid_format function we defined earlier work:

>>> invalid_format.__code__ = formathack_rewrite_bytecode__(invalid_format.__code__)
>>> invalid_format("bar")
'bar formatted with foo'

Success! Manually cursing code objects with tainted bytecode in itself won’t damn our souls to an eternity of suffering though; for that, we should manipulate all code automatically.

Part 2: Hooking into the import process

To make the new f-string behavior work everywhere, and not just in manually patched functions, we can customize the Python module import process with a custom module finder and loader using the functionality provided by the standard library importlib module:

class _FormatHackLoader(importlib.machinery.SourceFileLoader):
    """
    A module loader that modifies the code of the modules it imports to override
    the behavior of f-strings. Nasty stuff.
    """
    @classmethod
    def find_spec(cls, name, path, target=None):
        # Start out with a spec from a default finder
        spec = importlib.machinery.PathFinder.find_spec(
            fullname=name,
             # Only apply to modules and packages in the current directory
             # This prevents standard library modules or site-packages
             # from being patched.
            path=[""],
            target=target
        )
        if spec is None:
            return None
        
        # Modify the loader in the spec to this loader
        spec.loader = cls(name, spec.origin)
        return spec

    def get_code(self, fullname):
        # This is called by exec_module to get the code of the module
        # to execute it.
        code = super().get_code(fullname)
        # Rewrite the code to modify the f-string formatting opcodes
        rewritten_code = formathack_rewrite_bytecode__(code)
        return rewritten_code

    def exec_module(self, module):
        # We introduce the callback that hooks into the f-string formatting
        # process in every imported module
        module.__dict__["formathack_hook__"] = formathack_hook__
        return super().exec_module(module)

To make sure the Python interpreter uses this loader to import all files, we have to add it to sys.meta_path:

def install():
    # If the _FormatHackLoader is not registered as a finder,
    # do it now!
    if sys.meta_path[0] is not _FormatHackLoader:
        sys.meta_path.insert(0, _FormatHackLoader)
        # Tricky part: we want to be able to use our custom f-string behavior
        # in the main module where install was called. That module was loaded
        # with a standard loader though, so that's impossible without additional
        # dirty hacks.
        # Here, we execute the module _again_, this time with _FormatHackLoader
        module_globals = inspect.currentframe().f_back.f_globals
        module_name = module_globals["__name__"]
        module_file = module_globals["__file__"]
        loader = _FormatHackLoader(module_name, module_file)
        loader.load_module(module_name)
        # This is actually pretty important. If we don't exit here, the main module
        # will continue from the formathack.install method, causing it to run twice!
        sys.exit(0)

If we put it all together in a formathack module (see https://github.com/mivdnber/formathack for an integrated, working example), we can now use it like this:

# In your main Python module, install formathack ASAP
import formathack
formathack.install()

# From now on, f-string behavior will be overridden!

print(f"{foo:bar}")
# -> "foo formatted with bar"

So that’s that! You can expand on this to make the hook function more intelligent and useful (e.g. by registering functions that handle certain format specifiers).

Answered By: Michilus
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.