Slow php 8.0.26 performance

changed the description

@bluesky please provide more details about enable php extensions and which specific tests brings this variation (math, string, loops, if-else) from benchmark script

in 3.16 PHP using -O2 (changed from -Os) so curious which flags are used in CentOS

The differences are in Math ( 7.505 sec vs 10.923 sec ) and String ( 18.200 sec vs 26.360 sec ), the others are on par. phpinfo() on Centos does not give the compiler options used ( as it does with Alpine ) - the only thing I have there is gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3) used to compile it ( comes from Remi's RPM repository ).

Enclosed are the php -m output for both systems: modules.php.centos.txt modules.php.alpine.txt

Here is an interesting site I found that has been running the same type of test for many versions of PHP - does not help with the issue here, but good info to have - link

Thank you, math in PHP using assembly so should not depend on compiler (otoh 8 vs 12 could affect), strings not clear but surely could be caused by musl vs glibc

I checked with edge 8.1 and 8.2 (using clang) and see not much difference in numbers so probably external dependencies can have effect too

Thanks for the quick reply and testing.

I have run few more tests - on a newer system with Deb11, Rocky9, Ubuntu 22.04, Alpine 3.16 and few other containers ( all containers are based on lxc code from http://images.linuxcontainers.org ). I installed php 8.0 and 8.1 and in all cases Alpine was slower by about 30-40% - in the case of glibc based systems the numbers were very close ( to each other and between 8.0 and 8.1 ), as we would expect running on the same hardware.

In the end unless a benchmark is very comprehensive, it is not a true measure of performance - I have seen benchmarks of Wordpress where PHP 8.x is making a good difference - what I have not seen is a comparison of Wordpress running on Debian, CentOS, Ubuntu or another based on glibc vs an Alpine based Wordpress install on the same hardware.

Thanks great research! Synthetic tests are always not accurate but this high delta a signal that something is working wrong.

Real web-apps has different bottlenecks because here database and network connections, also it slightly depends on app.

I gonna dig how to to profile why math and strings as they looks more viable

I did some comparison of the string functions using the php docker images.

	alpine	deb
addslashes	0.387	0.184
chunk_split	0.485	0.277
metaphone	2.108	1.098
strip_tags	0.683	0.555
md5	1.093	1.054
sha1	1.737	1.357
strtoupper	0.737	0.396
strtolower	0.157	0.158
strrev	0.315	0.318
strlen	0.058	0.059
soundex	0.370	0.250
ord	0.083	0.085

I also did a search of number of how many times those functions are used in wordpress 6.1.1:

$ for i in addslashes chunk_split metaphone strip_tags md5 sha1 strtoupper strtolower strrev strlen soundex ord; do echo -e "$(ag -w "$i" | wc -l)\t$i"; done | sort -n
0	metaphone
1	soundex
6	chunk_split
16	addslashes
17	strrev
69	sha1
81	strip_tags
100	strtoupper
179	md5
241	ord
275	strtolower
855	strlen

so not all functions has degraded perf, but others really ~40% slower(

Using another script https://onlinephp.io/benchmarks/script (attaching src) detailed_benchmark.php

used to run on php81 php -ddisplay_errors=0 detailed_benchmark.php 8.1.13

the biggest diff is string functions: str_replace, sprintf, strcmp, implode, long2ip, strftime, strtoupper (some are twice longer)

func	alpine	deb
for	0.01195 sec	0.00695 sec
while	0.00595 sec	0.00674 sec
if else	0.03080 sec	0.03358 sec
switch	0.02900 sec	0.03986 sec
Ternary	0.02549 sec	0.03728 sec
str_replace	0.81203 sec	0.31503 sec
preg_replace	0.00399 sec	0.00497 sec
preg_match	0.09798 sec	0.12295 sec
count	0.01026 sec	0.01257 sec
isset	0.02911 sec	0.03083 sec
time	0.04614 sec	0.04154 sec
strlen	0.01393 sec	0.01255 sec
sprintf	0.42592 sec	0.18956 sec
strcmp	0.06557 sec	0.03141 sec
trim	0.07540 sec	0.05617 sec
explode	0.00609 sec	0.00673 sec
implode	0.48798 sec	0.15918 sec
number_format	0.37718 sec	0.33568 sec
floor	0.02241 sec	0.02998 sec
strpos	0.04009 sec	0.03587 sec
substr	0.05818 sec	0.04806 sec
intval	0.03836 sec	0.04879 sec
(int)	0.03756 sec	0.05423 sec
is_array	0.01615 sec	0.02882 sec
is_numeric	0.04899 sec	0.06019 sec
is_int	0.01452 sec	0.02510 sec
is_string	0.01497 sec	0.01711 sec
ip2long	0.03642 sec	0.04983 sec
long2ip	0.76732 sec	0.31315 sec
date	0.00924 sec	0.00884 sec
strftime	0.07319 sec	0.00613 sec
strtotime	0.01697 sec	0.01168 sec
strtolower	0.07774 sec	0.07641 sec
strtoupper	0.16179 sec	0.08379 sec
md5	0.19699 sec	0.22774 sec
unset	0.03517 sec	0.03545 sec
list	0.05008 sec	0.05567 sec
urlencode	0.13189 sec	0.15965 sec
urldecode	0.23370 sec	0.27844 sec
addslashes	0.15848 sec	0.10953 sec
stripslashes	0.09783 sec	0.09020 sec
	4.893 sec	3.298 sec

fixed final numbers - Alpine is around 5s but deb around 3s

Good job in getting those numbers.

One suggestion I would have is to increase the test run times ( by at least factor of 10 ). Even if you have a system with many cores and it can be dedicated there are still things happening in the background - for example on Intel it has hyper threading which itself would have overheads - I know it is a pain to wait for the numbers, but by having longer runs you will get values that are more representative, and have a better chance to factor out other variables. The simplest test is to run your test a number of times - given that the computations are deterministic your runs should all be about the same. That said I have not looked at the PHP microtime() function used to time the tests, so maybe this is not a factor.

I am happy that you have been able to confirm my findings - using your numbers looks like Alpine is 48% slower for the overall test - would be interesting to take @ncopa counts and multiply by the times from Alpine and Deb. and compare the two, and then be able to compare that to the requests/sec numbers - as I mentioned before there are many of those posted but I have not seen once that compare OS to OS - in most cases it is PHP version to version for many different applications. Good luck.

I think we need to profile specific functions (sprintf, implode) to find out what is the cause

@andypost isn't musl also weak on Python execution? What is the exact reason for that? Could it be related? Afaik the compilation time of the kernel is also slow due to the lack of compiler optimizations in BusyBox. I've read about PHP's JIT got a lot of improvements on version 8 which could further cause these issues.

Afaik the compilation time of the kernel is also slow due to the lack of compiler optimizations in BusyBox.

i'm not sure how those two are related at all. do you mean busybox awk being slow (noticeably for x86-only kernel builds)? gawk/mawk don't have that issue

Unfortunately, I don't know. I just learned about kernel compilation and Python being way slower from YouTube reviews. PHP has "improved" their JIT portion on v8, so it could be related, but who am I?

I'm not deep enough into this topic, but I want to make use of PHP. Maybe I will setup a benchmark before and after installing glibc and the coreutils to see if there are major improvements or if there are issues in general.

There's WIP to speed up execution by removing unnecessary libc calls https://github.com/php/php-src/pull/10501

(Disclaimer: I don't work for the PHPF and I'm not a maintainer, I only contribute as a hobby and I don't speak for the maintainers)

For the metaphone function I was able to optimize it because there were a lot of redundant libc calls. Removing this redundancy improved the performance for both glibc-based and musl-based systems, and makes the performance gap between musl and glibc smaller (although there will likely still be a small gap).

I looked at optimizing some of the other functions listed here. The string functions very often make calls to memcpy & memchr. Under glibc these are accelerated using SSE, AVX, or other specialised instructions which make them very fast. The musl implementation does not do that and is therefore quite a bit slower. I can clearly see on Alpine that for example for str_replace: 18% of the time is spent in memchr and 22% is spent in memcpy.

I just did a simple experiment by changing the memcpy in str_replace to an inline assembly routine using "rep movsb" (specialised instruction which is very fast on my CPU because my CPU is newer than Ivy Bridge). On Alpine 32-bit I got an average of 0.354s with my changes versus 0.591s without changes. A big speedup.

In theory PHP could provide certain optimized memory functions on some system configurations in order to achieve this kind of speedup. However, I think that implementing these accelerations into musl would be more beneficial as more applications can benefit from this. Also note that while using "rep movsb" instead of musl's memcpy was a lot faster, using "rep mobsb" instead of memset on my glibc-based distro was quite a bit slower. If we would end up making these kind of changes to PHP, we'd have to take the platform into consideration.

implementing these (SIMD) accelerations into musl

this most likely wouldn't be implemented (going by feeling), though i sadly don't have a link to mailing thread handy for the rationale.

aside from that, there isn't something specifically bottlenecked here. applications don't exactly spend all their runtime in str_replace and so on, though across the board there is then still some percentile improvement when you add everything up.

which leads to the real issue: that there is no issue. this is not a "i ran php on alpine, and it got an expected runtime of until the heat death of the universe" or "it consumed all my cpu for 6 hours" or "php deadlocked doing xyz" or anything else. it's "i ran some completely synthetic benchmarks not representative of any real-world application, that call a function in a loop, and the functions are X% (for small values of X, nothing by some huge 10x factors) slower than glibc". that is to say, there is nothing to fix here, this is not a bug report or a feature request, or.. anything. none of the numbers are even particularly shocking or interesting, except perhaps str_replace itself indeed (or for).

if you want to "improve synthetic benchmark performance compared to glibc when calling musl functions", then you can report that and/or implement it in musl itself, on the mailing list: https://www.openwall.com/lists/musl/ . (or put in other words- there is nothing alpine as a distro can do here).

if you think there is something miscompiled (e.g. completely wrong configuration, etc) for php itself here in alpine (that causes some specific, actionable, issues), then open an issue for that, and perhaps it can be worked on.

closed

unassigned @andypost

@psykose I respect your decision to close this, because I also believe this is a fundamental issue on musl and not Alpine, however I think Alpine still can do something about it.

Apparently a system admin named Emerson Gomes spend some time analyzing the allocator performance between musl and glib on Alpine, which also shows terrible results: https://www.linkedin.com/pulse/testing-alternative-c-memory-allocators-pt-2-musl-mystery-gomes/

He then tested malloc-ng that will be part of musl eventually, which improves, but only so little. But then he compiled and used mimalloc together with musl and now his benchmarks perform better than with glib.

I'd welcome to see mimalloc as a part of the main Alpine repository as it would give a lot of people easy options to improve performance noticeably.

He then tested malloc-ng that will be part of musl eventually

malloc-ng is musl's allocator since 1.2.1

I'd welcome to see mimalloc as a part of the Alpine as it would give a lot of people easy options to improve performance noticeably.

you can LD_PRELOAD anything you want yourself, mimalloc is already packaged

mentioned in merge request !47440 (merged)

mentioned in issue #15114

Slow php 8.0.26 performance

Child items 0

Activity