Why does the same hash function of Python and Rust produce different result for the same string?

Question:

TL;DR:

With the same parameters both hash functions produce the same results. There are few pre-conditions have to be met to achieve that.

I am building a system that has parts in Rust and Python. I need a hashing library that produces the same values for the same input on both ends. I thought that Python and Rust also uses SipHash 1-3 so I have tried to use that.

Python:

>>> import ctypes
>>> ctypes.c_size_t(hash(b'abcd')).value
14608482441665817778
>>> getsizeof(ctypes.c_size_t(hash(b'abcd')).value)
36
>>> type(b'abcd')
<class 'bytes'>

Rust:

use hashers::{builtin::DefaultHasher};
use std::hash::{Hash, Hasher};

pub fn hash_str(s: &str) -> u64 {
    let mut hasher = DefaultHasher::new();
    s.hash(&mut hasher);
    hasher.finish()
}

pub fn hash_bytes(b: &[u8]) -> u64 {
    let mut hasher = DefaultHasher::new();
    b.hash(&mut hasher);
    hasher.finish()
}

fn test_hash_str() {
    let s1: &str = "abcd";
    let h1: u64 = hash_str(s1);

    assert_eq!(h1, 13543138095457285553);
}
#[test]
fn test_hash_bytes() {
    let b1: &[u8] = "abcd".as_bytes();
    let h1: u64 = hash_bytes(b1);

    assert_eq!(h1, 18334232741324577590);
}

Unfortunately I am not able to produce the same values on both end. Is there a way to get the same values somehow?

UPDATE:

After checkin Python’s implementation there was a detail that I originally missed, so that Python uses a kind of random salt for every run. This means that the result I got from the Python function could not be the same as the Rust version.

This can be disabled with PYTHONHASHSEED=0 python …

However this still does not make Python produce the same vales as the Rust version. I have tried custom SipHash implementations on both end. The results are consistent on both ends:

Both use siphasher::sip::SipHasher13; and DefaultHasher produces the same outputs. The result for a String is the same as for the &str but different for the .as_bytes() version.

   #[test]
    fn test_hash_string() {
        let s1: String = "abcd".to_string();
        let h1: u64 = hash_string(s1);

        assert_eq!(h1, 13543138095457285553);
    }

    #[test]
    fn test_hash_str() {
        let s1: &str = "abcd";
        let h1: u64 = hash_str(s1);

        assert_eq!(h1, 13543138095457285553);
    }
    #[test]
    fn test_hash_bytes() {
        let b1: &[u8] = "abcd".as_bytes();
        let h1: u64 = hash_bytes(b1);

        assert_eq!(h1, 18334232741324577590);
    }

On Python side after disabling the randomization:

    sh = SipHash(c=1, d=3)
    h = sh.auth(0, "abcd")
    assert h == 16416137402921954953
Asked By: Istvan

||

Answers:

  • Don’t use the internal hasher for external purposes. It is not meant to be predictable or compatible, it is simply meant to be used as an internal hashing. Rust even mentions it in its docs:

    The internal algorithm is not specified, and so it and its hashes
    should not be relied upon over releases.

  • Don’t use Rust’s .hash() functionality of types. It’s also not meant for external hashes; it does some unspecified internal binarization of the data. Use the hasher’s .write functionality directly to feed it binary data.

That said, the solution is to use a specific hashing library for your purpose of compatibility, not the internal one.

In Rust, this is probably siphasher if you want siphash-1-3. I’m unsure about Python, though, as I didn’t use it in a while.

Here’s an example code for Rust:

use siphasher::sip::SipHasher13;

use std::hash::Hasher;

pub fn hash_str(s: &str) -> u64 {
    hash_bytes(s.as_bytes())
}

pub fn hash_bytes(b: &[u8]) -> u64 {
    let mut hasher = SipHasher13::new();
    hasher.write(b);
    hasher.finish()
}

#[test]
fn test_hash_str() {
    let s1: &str = "abcd";
    let h1: u64 = hash_str(s1);

    assert_eq!(h1, 16416137402921954953);
}

#[test]
fn test_hash_bytes() {
    let b1: &[u8] = "abcd".as_bytes();
    let h1: u64 = hash_bytes(b1);

    assert_eq!(h1, 16416137402921954953);
}

Note that while I really don’t recommend it, the same is true for Rust’s internal hasher:

use std::hash::Hasher;

fn main() {
    let mut hasher = std::collections::hash_map::DefaultHasher::new();
    hasher.write("abcd".as_bytes());
    println!("{}", hasher.finish());
}
16416137402921954953

Background

So why does s.hash() and s.as_bytes().hash() behave weirdly?

Let’s write a simple debug hasher:

use std::hash::{Hash, Hasher};

struct DebugHasher;

impl Hasher for DebugHasher {
    fn finish(&self) -> u64 {
        0
    }

    fn write(&mut self, bytes: &[u8]) {
        println!("   write: {:?}", bytes);
    }
}

fn main() {
    let s = "abcd";
    println!("--- s ---");
    s.hash(&mut DebugHasher);
    println!("--- s.as_bytes() ---");
    s.as_bytes().hash(&mut DebugHasher);
}
--- s ---
   write: [97, 98, 99, 100]
   write: [255]
--- s.as_bytes() ---
   write: [4, 0, 0, 0, 0, 0, 0, 0]
   write: [97, 98, 99, 100]

Now we have our answer:

  • s seems to append 0xff.
    This can also be seen in its source code:

    fn write_str(&mut self, s: &str) {
        self.write(s.as_bytes());
        self.write_u8(0xff);
    }
    
  • s.as_bytes() seems to attach weird bytes at the front. In its source code, it can be seen that this is the length of the string:
    #[stable(feature = "rust1", since = "1.0.0")]
    #[rustc_const_unstable(feature = "const_hash", issue = "104061")]
    impl<T: ~const Hash> const Hash for [T] {
        #[inline]
        fn hash<H: ~const Hasher>(&self, state: &mut H) {
            state.write_length_prefix(self.len());
            Hash::hash_slice(self, state)
        }
    }
    
Answered By: Finomnis
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.