Why os.getppid() and multiprocessing.parent_process().pid got different result using multiprocessing in airflow 2.x?

Question:

I found that when using airflow, using multiprocessing causes an assert error. I solved my error ( this discussion and this discussion ). but I was curious about how process actually works in airflow job, so I ran the code.

def process_function(i):
    parent_process = multiprocessing.parent_process().pid
    parent_process_daemon = (
        multiprocessing.parent_process().daemon
        if multiprocessing.parent_process() is not None
        else None
    )
    current_process = multiprocessing.current_process().pid
    is_daemon = multiprocessing.current_process().daemon
    result = (
        f"{i}th task : "
        + "partent_process : "
        + str(parent_process)
        + " is daemon : "
        + str(parent_process_daemon)
        + " current_process : "
        + str(current_process)
        + " is daemon : "
        + str(is_daemon)
    )
    time.sleep(3)
    return result


def mp(run_n: int):
    print(f"start checking multiprocessing pid task")
    print("[1] check pid using os modlue")
    print(f"parent process : {os.getppid()} current process : {os.getpid()}")
    print("[2] check pid using multiprocessingmodlue")
    print(
        f"parent process : {multiprocessing.parent_process().pid if multiprocessing.parent_process() is not None else None} is daemon? : {multiprocessing.parent_process().daemon if multiprocessing.parent_process() is not None else None} process : {multiprocessing.current_process().pid} is daemon? : {multiprocessing.current_process().daemon}"
    )

    results = []
    print(f"start job")
    with concurrent.futures.ProcessPoolExecutor() as process_executor:
        for pp_res in process_executor.map(process_function, [i for i in range(run_n)]):
            results.append(pp_res)
    print(f"job done")
    for c in results:
        print(c)
...

with models.DAG(
    dag_id="daemon_test",
    description="daemon_test",
    schedule_interval="0 * * * *",
    default_args=default_args,
    catchup=False,
) as dag:
   test_job = PythonOperator(
        task_id="test_job ",
        python_callable=mp,
        op_kwargs={
            "run_n": 5,
        },
    )

and result

{standard_task_runner.py:52} INFO - Started process 67070 to run task
...
start checking multiprocessing pid task
[1] check pid using os modlue
parent process : 67069 current process : 67070

[2] check pid using multiprocessing modlue
parent process : None is daemon? : None process : 67070 is daemon? : True
...
0th task : partent_process : 67070 is daemon : False current_process : 67071 is daemon : True
1th task : partent_process : 67070 is daemon : False current_process : 67072 is daemon : True
2th task : partent_process : 67070 is daemon : False current_process : 67073 is daemon : True
3th task : partent_process : 67070 is daemon : False current_process : 67074 is daemon : True
4th task : partent_process : 67070 is daemon : False current_process : 67071 is daemon : True

Even though the pid value can be changed, I don’t think that None value can come out. Does anyone know the reason for this?

Asked By: London_-_

||

Answers:

You’re right; a PID (or a parent PID) can’t be None.

multiprocessing.parent_process() returns the value of the (multiprocessing-child-)internal global multiprocessing.process._parent_process.

That global is only set here in the subprocesses only when the process is spawned as a multiprocessing child.

os.getppid(), on the other hand, just calls the OS function to get the parent PID, be it a multiprocessing parent or something else.

Answered By: AKX