rust和python一些性能基准测试以及pyo3的基本使用

前言¶

最近思考的一些事情: Python的速度. 我是一个Python迷, 人们对Python最大的抱怨之一是它慢,有些人甚至拒绝尝试Python因为他是在是太慢了. 这里我想说下, py的慢,不是针对py本身提供的标准库,如果你熟练使用标准库以及py本身提供方法,在一些方面,性能是接近于c的; py的慢是体现在复杂业务处理上, 这时候我们完全可以一些计算相关的业务抽离出去,集中进行处理,

取长补短, 一切提前的优化都是罪恶之源, 所以现在请快速使用 py 构建出你的产品原型!

测试1: n个字符中查找指定字符¶

作者这里分别使用了 rust release, 以及使用pyo3打包后的so, 以及py分别作了测试

rust release这里分别使用了 for_each/for, match/if以及filter().count()性能都是差不多的,大概都花费60ms(不过if比match稍微快那么一丢丢), 有意思的是循环1000w个字符空循环所花费的时间都要40ms, 这里考虑rust 是在处理 String类型由Vec 到 char 的相关转换
一个关于rust查找 char 进行比较的例子https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=4bfec0bf097055662dbe8037eae34694 从上到下速度依次加快
如果整个字符串都是ascii码表中的字符, 使用bytes().fold进行比较最快, 其次是bytes()转为迭代器后使用if/match判断...
分享一个并行计算库rayon: https://rustcc.cn/article?id=181e0a73-6742-42a9-b7a1-1c00bef436c2, 最后一个例子使用了该库, 会根据根据cpu核心数分配任务

rust测试¶

use rayon::prelude::*;

fn main() {
   println!("Hello, world!");

   let mut s = "".to_string();
   for i in 0..1000_0000 {
       s += &(*i.to_string());
   }


   let _s = s.clone();
   let mut count = 0;
   let t = std::time::Instant::now();
   for i in _s.chars(){
       match i {
           '0' => count += 1,
           _ => (),
       }
   }
   println!("{}, match: {:?}", count, t.elapsed());


   let _s = s.clone();
   let mut count = 0;
   let t = std::time::Instant::now();
   for i in _s.chars(){
       if i =='0' {
           count += 1;
       }
   }
   println!("{}, if: {:?}", count, t.elapsed());


   let _s = s.clone();
   let t = std::time::Instant::now();
   let count = _s.chars().filter(|x| -> bool {
       x == &'0'
   }).count();
   println!("{}, chars().filter().count(): {:?}",count, t.elapsed());


   let _s = s.clone();
   let mut count = 0;
   let t = std::time::Instant::now();
   _s.chars().for_each(|x|{
       match x {
           '0' => count += 1,
           _ => (),
       }
   });
   println!("{}, for_each: {:?}",count, t.elapsed());


   let _s = s.clone();
   let mut count = 0;
   let t = std::time::Instant::now();
   for i in _s.chars(){
   }
   println!("{}, 空循环: {:?}",count, t.elapsed());


   let _s = s.clone();
   let t = std::time::Instant::now();
   let count = _s.chars().fold(0,|count, b| if b == '0' { count + 1 } else { count });
   println!("{}, chars().fold() : {:?}",count, t.elapsed());


   let _s = s.clone();
   let t = std::time::Instant::now();
   let count = _s.bytes().fold(0,|count, b| if b == b'0' { count + 1 } else { count });
   println!("{}, bytes().fold() : {:?}",count, t.elapsed());


   let _s = s.clone();
   let t = std::time::Instant::now();
   let count = _s.par_bytes().fold(|| 0,|cnt, b| if b == b'0' { cnt + 1 } else { cnt }).sum::<u32>();
   println!("{:?}, par_bytes().fold().sum() : {:?}",count, t.elapsed());
}
Hello, world!
5888890, match: 66.096802ms
5888890, if: 60.917747ms
5888890, chars().filter().count(): 74.986586ms
5888890, for_each: 71.493254ms
0, 空循环: 48.239113ms
5888890, chars().fold() : 80.2318ms
5888890, bytes().fold() : 14.920846ms
5888890, par_bytes().fold().sum() : 4.275235ms

pyo测试¶

在rust中打包so对字符进行比较, 性能比py原生的count方法性能下降一倍(其中有从py到rust数据内存拷贝原因), 在rust使用rayon后, 性能提高一倍, 但是rayon有一个问题就是, 会吃满多核cpu, 而count方法仅仅使用了单核

rust侧

use pyo3::prelude::*;
use pyo3::wrap_pyfunction;
use pyo3::types::*;
use rayon::prelude::*;

// 定义一个模块
#[pymodule]
fn base_type(py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(sum_ab_1, m)?).unwrap();
    m.add_function(wrap_pyfunction!(sum_ab_2, m)?).unwrap();
    m.add_function(wrap_pyfunction!(count_1, m)?).unwrap();
    m.add_function(wrap_pyfunction!(count_2, m)?).unwrap();
    Ok(())
}


// 两个数字相加求和(使用i32类型)
#[pyfunction]
pub fn sum_ab_1(a:i32, b:i32) -> i32 {
    a + b
}

// 两个数字相加求和(使用rust实现的 py类型 来接收)
#[pyfunction]
pub fn sum_ab_2(a:&PyLong, b:&PyAny) -> PyResult<i32> {
    let a = a.extract::<i32>()?;  // 使用类型传递获取指定类型
    let b:i32 = b.extract()?;  // PyAny使用变量注解来

    Ok(a + b)
}

// 查找字符串
#[pyfunction]
pub fn count_1(a:&PyString, b:&PyString) -> PyResult<i32> {
    let mut c = 0;
    let str: String = a.extract()?;
    for i in str.chars() {
        if i == '0' {
            c += 1
        };
    };
    Ok(c)
}

#[pyfunction]
pub fn count_2(a:&PyString, b:&PyString) -> PyResult<i32> {
    let str: String = a.extract()?;
    let count: i32 = str.par_chars().fold(|| 0, |cnt, i| {
        if i == '0' {
            cnt + 1
        } else {
            cnt
        }
    }, ).sum();
    Ok(count)
}



#[cfg(test)]
mod test {

    #[test]
    fn test_find() {
        println!("Ok")
        // let mut s = "".to_string();
        // for i in 0..10000000 {
        //     s += &(*i.to_string());
        // }
        // let mut c = 0;
        // let a = "".to_string();
        // let t = std::time::Instant::now();
        // for i in s.chars(){
        //     match i {
        //         '0' => c += 1,
        //         _ => (),
        //     }
        // }
        // println!("{:?}", t.elapsed());
    }
}

python侧

import base_type
import time

# #  数字测试
# st = time.perf_counter()
# print(base_type.sum_ab_2(1,2))
# print(time.perf_counter() - st)
#
#
#
# st = time.perf_counter()
# print(base_type.sum_ab_1(1,2))
# print(time.perf_counter() - st)


# 字符串查找元素测试
a = ""
for i in range(1000_0000):
    a += str(i)

st = time.perf_counter()
print("count: {}, {:.10f}".format(a.count("0"),time.perf_counter() - st))


st = time.perf_counter()
c = 0
for _ in a:
    if _ == "0":
        c += 1
print("for: {}, {:.10f}".format(c, time.perf_counter() - st))


st = time.perf_counter()
print("chars: {}, {:.10f}".format(base_type.count_1(a,"0"), time.perf_counter() - st))

st = time.perf_counter()
print("par_chars: {}, {:.10f}".format(base_type.count_2(a,"0"), time.perf_counter() - st))

输出

count: 5888890, 0.04730614899999974
for: 5888890, 3.882953487
chars: 5888890, 0.09862988599999944
par_chars: 5888890, 0.023387593000000706

测试2: 翻转列表中的hashmap键值对¶

作者这里对1000_0000万个hashmap组成的vector进行减值对翻转,奇怪的是rust release需要花费4秒, 而python仅仅使用了5秒, 可能是我rust代码不是最优解, 如果有比较好的实现方法, 请指教

我这里尝试使用BTreeMap和HashMap花费时间进行比较, 实测无论在哪一侧HashMap处理更快一点

这里发现一个知识点: 使用into_iter,性能较iter要高1倍多, 确实 ~~如果想保留原本的所有权又想弄一份新的数据出来, iter方法是需要进行copy 或者clone的~~, 由于这里源码是unsafe的, 我也没怎么看(这里我也没研究透, iter 得到的内的item都是 &, into_iter 都是源数据, 那这样看, iter也没clone, 操作的仅仅是引用, 为什么性能会差那么大, 这个有待考证),

如果后面对这个数据不再使用, 那么应该尽量使用 into_iter

rust测试¶

    pub fn reverse_dict_String_BTreeMap(li: Vec<BTreeMap<String, i32>>) -> Vec<BTreeMap<i32,String>> {
        li.into_iter().map(|x| {
            let mut new_hashmap: BTreeMap<i32, String> = BTreeMap::new();
            for (k, v) in x.into_iter() {
                new_hashmap.insert(v, k);
            };
            new_hashmap
        }).collect()
    }

    pub fn reverse_dict_String_HashMap(li: Vec<HashMap<String, i32>>) -> Vec<HashMap<i32,String>> {
        li.into_iter().map(|x| {
            let mut new_hashmap: HashMap<i32, String> = HashMap::new();
            for (k, v) in x.into_iter() {
                new_hashmap.insert(v, k);
            };
            new_hashmap
        }).collect()
    }

    let mut v:Vec<BTreeMap<String,i32>> = Vec::with_capacity(1000_0000);
    for i in 0..1000_0000{
        let mut h:BTreeMap<String,i32> = BTreeMap::new();
        h.insert(i.to_string(), i);
        v.push(h)
    }
    let t = std::time::Instant::now();
    reverse_dict_String_BTreeMap(v);
    println!("BTreeMap 1000w: {:?}", t.elapsed());
    let mut v:Vec<BTreeMap<String,i32>> = Vec::with_capacity(10_0000);
    for i in 0..10_0000{
        let mut h:BTreeMap<String,i32> = BTreeMap::new();
        h.insert(i.to_string(), i);
        v.push(h)
    }
    let t = std::time::Instant::now();
    reverse_dict_String_BTreeMap(v);
    println!("BTreeMap 10w: {:?}", t.elapsed());



    let mut v:Vec<HashMap<String,i32>> = Vec::with_capacity(1000_0000);
    for i in 0..1000_0000{
        let mut h:HashMap<String,i32> = HashMap::new();
        h.insert(i.to_string(), i);
        v.push(h)
    }
    let t = std::time::Instant::now();
    reverse_dict_String_HashMap(v);
    println!("HashMap 1000w: {:?}", t.elapsed());
    let mut v:Vec<HashMap<String,i32>> = Vec::with_capacity(10_0000);
    for i in 0..10_0000{
        let mut h:HashMap<String,i32> = HashMap::new();
        h.insert(i.to_string(), i);
        v.push(h)
    }
    let t = std::time::Instant::now();
    reverse_dict_String_HashMap(v);
    println!("HashMap 10w: {:?}", t.elapsed());
BTreeMap 1000w: 6.109998562s
BTreeMap 10w: 76.668964ms
HashMap 1000w: 4.25503134s
HashMap 10w: 47.265982ms

pyo测试¶

作者这里尝试使用两种键值对结构TreeMap和HashMap, 打包成so给py调用, 和上面同样, py侧调用也是HashMap速度最快, 而rust侧使用 &str和String类型分别作为接收参数测试, 明显&str速度更快

很明显这里打包成动态链接库给py调用, 性能并没有变快反而更慢了, 一部分是由于从python侧传递数据到rust侧, 内存拷贝花费了大部分时间

这样的操作不值得专门优化了

rust侧

// 翻转 vector中元素的键值对
#[pyfunction]
pub fn reverse_dict_String_BTreeMap(li: Vec<BTreeMap<String, i32>>) -> Vec<BTreeMap<i32,String>> {
    li.into_iter().map(|x| {
        let mut new_hashmap: BTreeMap<i32, String> = BTreeMap::new();
        for (k, v) in x.into_iter() {
            new_hashmap.insert(v, k);
        };
        new_hashmap
    }).collect()
}
#[pyfunction]
pub fn reverse_dict_String_HashMap(li: Vec<HashMap<String, i32>>) -> Vec<HashMap<i32,String>> {
    li.into_iter().map(|x| {
        let mut new_hashmap: HashMap<i32, String> = HashMap::new();
        for (k, v) in x.into_iter() {
            new_hashmap.insert(v, k);
        };
        new_hashmap
    }).collect()
}
#[pyfunction]
pub fn reverse_dict_str_HashMap(li: Vec<HashMap<&str, i32>>) -> Vec<HashMap<i32,&str>> {
    li.into_iter().map(|x| {
        let mut new_hashmap: HashMap<i32, &str> = HashMap::new();
        for (k, v) in x.into_iter() {
            new_hashmap.insert(v, k);
        };
        new_hashmap
    }).collect()
}

python侧

# 3. 传入 list[dict1, dict2, dict3, ...], 翻转 k,v 返回一个新字典
li = []
for i in range(1000_0000):
    li.append({str(i):i})

def python_reverse_dict_1(li):
    new_li = []
    for d in li:
        [(k,v)] = d.items()
        new_li.append({v:k})
    return new_li


st = time.perf_counter()
base_type.reverse_dict_String_BTreeMap(li)
print("reverse_dict_String_BTreeMap:  {:.10f}".format(time.perf_counter() - st))

st = time.perf_counter()
base_type.reverse_dict_String_HashMap(li)
print("reverse_dict_String_HashMap:  {:.10f}".format(time.perf_counter() - st))


st = time.perf_counter()
base_type.reverse_dict_str_HashMap(li)
print("reverse_dict_str_HashMap:  {:.10f}".format(time.perf_counter() - st))


st = time.perf_counter()
python_reverse_dict_1(li)
print("python_reverse_dict_1:  {:.10f}".format(time.perf_counter() - st))

输出

reverse_dict_String_BTreeMap:  16.7254950910
reverse_dict_String_HashMap:  13.5615541420
reverse_dict_str_HashMap:  10.8162092780
python_reverse_dict_1:  5.3796175480

pyo3 的使用¶

rust接收参数与py的类型映射关系¶

接受函数参数时, 可以使用Rust本身的类型或Pyo3实现的一些Python类型
Pyo3实现的Python类型都实现了 FromPyObject trait, 可以方便的使用 extract 从Python对象中提取某种类型
如果不知道类型, 我们在rust侧, 可以使用 &PyAny 来尝试接收一个值
还可以在rust侧实现更严格的类型检查, 可以指定Vec<i32>, 这样在打包后的函数中支撑传递包含整数的列表到动态链接库中

Python 类型	对应 Rust 类型	对应Pyo实现的Python类型
`object`	-	`&PyAny`
`str`	`String`, `Cow<str>`, `&str`	`&PyUnicode`
`bytes`	`Vec<u8>`, `&[u8]`	`&PyBytes`
`bool`	`bool`	`&PyBool`
`int`	Any integer type (`i32`, `u32`, `usize`, etc)	`&PyLong`
`float`	`f32`, `f64`	`&PyFloat`
`complex`	`num_complex::Complex`	`&PyComplex`
`list[T]`	`Vec<T>`	`&PyList`
`dict[K, V]`	`HashMap<K, V>`, `BTreeMap<K, V>`, `hashbrown::HashMap<K, V>`	`&PyDict`
`tuple[T, U]`	`(T, U)`, `Vec<T>`	`&PyTuple`
`set[T]`	`HashSet<T>`, `BTreeSet<T>`, `hashbrown::HashSet<T>`	`&PySet`
`frozenset[T]`	`HashSet<T>`, `BTreeSet<T>`, `hashbrown::HashSet<T>`	`&PyFrozenSet`
`bytearray`	`Vec<u8>`	`&PyByteArray`
`slice`	-	`&PySlice`
`type`	-	`&PyType`
`module`	-	`&PyModule`
`datetime.datetime`	-	`&PyDateTime`
`datetime.date`	-	`&PyDate`
`datetime.time`	-	`&PyTime`
`datetime.tzinfo`	-	`&PyTzInfo`
`datetime.timedelta`	-	`&PyDelta`
`typing.Optional[T]`	`Option<T>`	-
`typing.Sequence[T]`	`Vec<T>`	`&PySequence`
`typing.Iterator[Any]`	-	`&PyIterator`
`typing.Union[...]`	See `#[derive(FromPyObject)\]`	-

rust返回参数与py的类型映射关系¶

如果函数容易出错, 尽量返回PyErr的PyResult或Result, 其中E从PyErr的实现。如果返回Err变量, 这将引发Python异常

Rust 类型	对应 Python 类型
`String`	`str`
`&str`	`str`
`bool`	`bool`
Any integer type (`i32`, `u32`, `usize`, etc)	`int`
`f32`, `f64`	`float`
`Option<T>`	`Optional[T]`
`(T, U)`	`Tuple[T, U]`
`Vec<T>`	`List[T]`
`HashMap<K, V>`	`Dict[K, V]`
`BTreeMap<K, V>`	`Dict[K, V]`
`HashSet<T>`	`Set[T]`
`BTreeSet<T>`	`Set[T]`
`&PyCell<T: PyClass>`	`T`
`PyRef<T: PyClass>`	`T`
`PyRefMut<T: PyClass>`	`T`

小结¶

尽量使用py本身提供的方法, py本身提供的方法是经过c封装后的, 已经接近底层
尽量减少内存拷贝, 如果需要调用外部动态链接库, 尽量在外部准备好数据,一次性传递到库中
如果处理数据量小, 且数据需要复杂处理, 那尽量使用py处理; 如果数据量很大,业务处理复杂,再考虑使用动态链接库加速