Get python function source excluding the docstring?
Question:
You might want to have the docstring not affect the hash for example like in joblib memory.
Is there a good way of stripping the docstring? inspect.getsource and inspect.getdoc kind of fight each other: the docstring is “cleaned” in one.
Answers:
One approach is to delete the docstring from the source using regex:
nodoc = re.sub(":s'''.*?'''", "", source)
nodoc = re.sub(':s""".*?"""', "", nodoc)
currently works for functions and classes only, maybe someone finds a pattern for modules too
If you just want to hash the body of a function, regardless of the docstring, you can use the function.__code__
attribute.
It gives access to a code
object which is not affected by the docstring.
unfortunately, using this, you will not be able to get a readable version of the source
def foo():
"""Prints 'foo'"""
print('foo')
print(foo.__doc__) # Prints 'foo'
print(foo.__code__.co_code) # b'tx00dx01x83x01x01x00dx02Sx00'
foo.__doc__ += 'pouet'
print(foo.__doc__) # Prints 'foo'pouet
print(foo.__code__.co_code) # b'tx00dx01x83x01x01x00dx02Sx00'
There is a simple solution
def fun(a,b):
'''hahah'''
return a+b
# we simply delete the docstring
fun.__doc__ = ''
print(help(fun))
this code yields:
Help on function fun in module __main__:
fun(a, b)
In case anyone is still looking for a solution for this, this is how I managed to build it:
from ast import Module, Expr, FunctionDef, parse
from inspect import getsource
from textwrap import dedent
from types import FunctionType
from typing import cast
def get_source_without_docstring(obj: FunctionType) -> str:
# Get cleanly indented source code of the function
obj_source = dedent(getsource(obj))
# Parse the source code into an Abstract Syntax Tree.
# The root of this tree is a Module node.
module: Module = parse(obj_source)
# The first child of a Module node is FunctionDef node that represents
# the function definition. We cast module.body[0] to FunctionDef for type safety.
function_def = cast(FunctionDef, module.body[0])
# The first statement of a function could be a docstring, which in AST
# is represented as an Expr node. To remove the docstring, we need to find
# this Expr node.
first_stmt = function_def.body[0]
# Check if first statement is an expression (docstring is an expression)
if isinstance(first_stmt, Expr):
# Split the original source code by lines
code_lines: list[str] = obj_source.splitlines()
# Delete the lines corresponding to the docstring from the list.
# Note: We are using 0-based list index, but the line numbers in the
# parsed AST nodes are 1-based. So, we need to subtract 1 from the
# 'lineno' property of the node.
del code_lines[first_stmt.lineno - 1 : first_stmt.end_lineno]
# Join the remaining lines back into a single string
obj_source = "n".join(code_lines)
# Return the source code of function without docstrings
return obj_source
Note: code by myself, comments by OpenAI’s GPT
You might want to have the docstring not affect the hash for example like in joblib memory.
Is there a good way of stripping the docstring? inspect.getsource and inspect.getdoc kind of fight each other: the docstring is “cleaned” in one.
One approach is to delete the docstring from the source using regex:
nodoc = re.sub(":s'''.*?'''", "", source)
nodoc = re.sub(':s""".*?"""', "", nodoc)
currently works for functions and classes only, maybe someone finds a pattern for modules too
If you just want to hash the body of a function, regardless of the docstring, you can use the function.__code__
attribute.
It gives access to a code
object which is not affected by the docstring.
unfortunately, using this, you will not be able to get a readable version of the source
def foo():
"""Prints 'foo'"""
print('foo')
print(foo.__doc__) # Prints 'foo'
print(foo.__code__.co_code) # b'tx00dx01x83x01x01x00dx02Sx00'
foo.__doc__ += 'pouet'
print(foo.__doc__) # Prints 'foo'pouet
print(foo.__code__.co_code) # b'tx00dx01x83x01x01x00dx02Sx00'
There is a simple solution
def fun(a,b):
'''hahah'''
return a+b
# we simply delete the docstring
fun.__doc__ = ''
print(help(fun))
this code yields:
Help on function fun in module __main__:
fun(a, b)
In case anyone is still looking for a solution for this, this is how I managed to build it:
from ast import Module, Expr, FunctionDef, parse
from inspect import getsource
from textwrap import dedent
from types import FunctionType
from typing import cast
def get_source_without_docstring(obj: FunctionType) -> str:
# Get cleanly indented source code of the function
obj_source = dedent(getsource(obj))
# Parse the source code into an Abstract Syntax Tree.
# The root of this tree is a Module node.
module: Module = parse(obj_source)
# The first child of a Module node is FunctionDef node that represents
# the function definition. We cast module.body[0] to FunctionDef for type safety.
function_def = cast(FunctionDef, module.body[0])
# The first statement of a function could be a docstring, which in AST
# is represented as an Expr node. To remove the docstring, we need to find
# this Expr node.
first_stmt = function_def.body[0]
# Check if first statement is an expression (docstring is an expression)
if isinstance(first_stmt, Expr):
# Split the original source code by lines
code_lines: list[str] = obj_source.splitlines()
# Delete the lines corresponding to the docstring from the list.
# Note: We are using 0-based list index, but the line numbers in the
# parsed AST nodes are 1-based. So, we need to subtract 1 from the
# 'lineno' property of the node.
del code_lines[first_stmt.lineno - 1 : first_stmt.end_lineno]
# Join the remaining lines back into a single string
obj_source = "n".join(code_lines)
# Return the source code of function without docstrings
return obj_source
Note: code by myself, comments by OpenAI’s GPT