On the float_precision argument to pandas.read_csv

Question:

The documentation for the argument in this post’s title says:

float_precision : string, default None

Specifies which converter the C engine should use for floating-point values. The options are None for the ordinary converter, high for the high-precision converter, and round_trip for the round-trip converter.

I’d like to learn more about the three algorithms mentioned, preferably without having to dig into the source code1.


Q: Do these algorithms have names I can Google for to learn exactly what they do and how they differ?


(Also, one side question: what exactly is "the C engine" in this context? Is that a Pandas-specific thing, or a Python-wide thing? None of the above?)


1 Not being familiar with the code base in question, I expect it would take me a long time just to locate the relevant source code. But even assuming I manage to find it, my experience with this sort of algorithm is that their implementations are so highly optimized, and at such a low level, that without some high-level description it is really difficult, at least for me, to follow what’s going on.

Asked By: kjo

||

Answers:

You asked about the actual algorithms – the closest I can find is:
https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/parsers.pyx#L492

This is taken from a related answer, kudos to MaxU (Understanding pandas.read_csv() float parsing)

Ordinary: double_converter_nogil = xstrtod
High: double_converter_nogil = precise_xstrtod
Round-Trip: double_converter_withgil = round_trip

From here, you’re in C-land. You also asked why pandas uses C – critical code paths are written in Cython or C.

Answered By: MisterJT

These options represent three different approaches to converting characters to a float. The difference is mostly in the precision. While the question did not ask for the code, the code defines the algorithm and is informative.

The legacy option uses the following algorithm (which is closely related to this code: https://github.com/WarrenWeckesser/textreader/blob/master/src/xstrtod.c):

https://github.com/pandas-dev/pandas/blob/573e7eaffd801ee5bd1f7685697b51eef5b8ed85/pandas/_libs/src/parser/tokenizer.c#L1478

The default, high option, uses the following:

https://github.com/pandas-dev/pandas/blob/573e7eaffd801ee5bd1f7685697b51eef5b8ed85/pandas/_libs/src/parser/tokenizer.c#L1615

The round_trip option uses Python’s own PyOS_string_to_double which by all measures is the most complicated. This approach guarantees compatibility with other places Python interprets strings as floats, but sets exceptions and as such must keep the GIL.

https://github.com/pandas-dev/pandas/blob/573e7eaffd801ee5bd1f7685697b51eef5b8ed85/pandas/_libs/src/parser/tokenizer.c#L1828

The core of PyOS_string_to_double is the private function _Py_dg_strtod, (which is closely based on this http://www.netlib.org/fp/dtoa.c):

https://github.com/python/cpython/blob/054328f0dd3e9ee5a8a026dfcfa606baf7e9f052/Python/dtoa.c#L1439

Answered By: flexatone