* fix: use sse instead of sse4 * fix: use dispatch * fix: remove lzero Co-authored-by: liuqiang <liuqiang.06@bytedance.com>