Hacker Newsnew | past | comments | ask | show | jobs | submit | igrunert's commentslogin

WebKit on Windows has progressed since ~5 years ago. The gap between the Windows port and the Linux WPE/GTK ports is shrinking over time.

Every JIT tier has been enabled for JSC on Windows[1], and libpas (the custom memory allocator) has been enabled.

The Windows port has moved from Cairo to Skia, though it's currently using the CPU renderer AFAIK. There's some work to enable the COORDINATED_GRAPHICS flag which would enable Windows to benefit from Igalia's ongoing work on improving the render pipeline for the Linux ports. I go into more detail on my latest update [2], though the intended audience is really other WebKit contributors.

Webkit's CI (EWS) is running the layout tests on Windows, and running more tests on Windows is mostly a matter of test pruning, bug fixes and funding additional hardware.

There's a few things still disabled on the Windows port, some rough edges, and not a lot of production use (Bun and Playwright are the main users I'm aware of). The Windows port really needs more people (and companies) pushing it forward. Hopefully Kagi will be contributing improvements to the Windows port upstream as they work on Orion for Windows.

[1] https://iangrunert.com/2024/10/07/every-jit-tier-enabled-jsc... [2] https://iangrunert.com/2025/11/06/webkit-windows-port-update...


The layout tests on Windows are failing at the moment due to a regression introduced a few days ago (likely https://commits.webkit.org/301043@main).

The easiest way to run WebKit on Windows is via Playwright.


A CMPXCHG16B instruction is going to be faster than a function call; and if the function is inlined there's still binary size cost.

The last processor without the CMPXCHG16B instruction was released in 2006 so far as I can tell. Windows 8.1 64-bit had a hard requirement on the CMPXCHG16B instruction, and that was released in 2013 (and is no longer supported as of 2023). At minimum Firefox should be building with -mcx16 for the Windows builds - it's a hard requirement for the underlying operating system anyway.


Let me play devil's advocate: for some reason, functions such as strcpy in glibc have multiple runtime implementations and are selected by the dynamic linker at load time.


And there's a performance cost to that. If there was only one implementation of strcpy and it was the version that happens to be picked on my particular computer, and that implementation was in a header so that it could be inlined by my compiler, my programs would execute faster. The downside would be that my compiled program would only work on CPUs with the relevant instructions.

You could also have only one implementation of strcpy and use no exotic instructions. That would also be faster for small inputs, for the same reasons.

Having multiple implementations of strcpy selected at runtime optimizes for a combination of binary portability between different CPUs and for performance on long input, at the cost of performance for short inputs. Maybe this makes sense for strcpy, but it doesn't make sense for all functions.


> my programs would execute faster

You can't really state this with any degree of certainty when talking about whole-program optimization and function inlining. Even with LTO today you're talking 2-3% overall improvement in execution time, without getting into the tradeoffs.


Typically, making it possible for the compiler to decide whether or not to inline a function is going to make code faster compared to disallowing inlining. Especially for functions like strcpy which have a fairly small function body and therefore may be good inlining targets. You're right that there could be cases where the inliner gets it wrong. Or even cases where the inliner got it right but inlining ended up shifting around some other parts of the executable which happened to cause a slow-down. But inliners are good enough that, in aggregate, they will increase performance rather than hurt it.

> Even with LTO today you're talking 2-3% overall improvement in execution time

Is this comparing inlining vs no inlining or LTO vs no LTO?

In any case, I didn't mean to imply that the difference is large. We're literally talking about a couple clock cycles at most per call to strcpy.


What I was trying to point out is that you're essentially talking about LTO. Getting into the weeds, the compiler _can't_ optimize strcpy(*) in practice because its not going to be defined in a header-only library, it's going to be in a different translation unit that gets either dynamically or statically linked. The only way to optimize the function call is with LTO - and in practice, LTO only accounts for 2-3% of performance improvements.

And at runtime, there is no meaningful difference between strcpy being linked at runtime or ahead of time. libc symbols get loaded first by the loader and after relocation the instruction sequence is identical to the statically linked binary. There is a tiny difference in startup time but it's negligible.

Essentially the C compilation and linkage model makes it impossible for functions like strcpy to be optimized beyond the point of a function call. The compiler often has exceptions for hot stdlib functions (like memcpy, strcpy, and friends) where it will emit an optimized sequence for the target but this is the exception that proves the rule. In practice, the benefits of statically linking in dependencies (like you're talking about) does not have a meaningful performance benefit in my experience.

(*) strcpy is weird, like many libc functions its accessible via __builtin_strcpy in gcc which may (but probably won't) emit a different sequence of instructions than the call to libc. I say "probably" because there are semantics undefined by the C standard that the compiler cannot reason about but the linker must support, like preloads and injection. In these cases symbols cannot be inlined, because it would break the ability of someone to inject a replacement for the symbol at runtime.


> What I was trying to point out is that you're essentially talking about LTO. Getting into the weeds, the compiler _can't_ optimize strcpy(*) in practice because its not going to be defined in a header-only library, it's going to be in a different translation unit that gets either dynamically or statically linked.

Repeating the part of my post that you took issue with:

> If there was only one implementation of strcpy and it was the version that happens to be picked on my particular computer, and that implementation was in a header so that it could be inlined by my compiler, my programs would execute faster.

So no, I'm not talking about LTO. I'm talking about a hypothetical alternate reality where strcpy is in a glibc header so that the compiler can inline it.

There are reasons why strcpy can't be in a header, and the primary technical one is that glibc wants the linker to pick between many different implementations of strcpy based on processor capabilities. I'm discussing the loss of inlining as a cost of having many different implementations picked at dynamic link time.


Afaik runtime linkers can't convert a function call into a single non-call instruction.


Linux kernel has an interesting optimization using the ALTERNATIVE macro, where you can directly specify one of two instructions and it will be patched at runtime depending on cpu flags. No function calls needed (although you can have a function call as one of the instructions). It's a bit more messy in userspace where you have to respect platform page flags, etc. but it should be possible.


They could always just make the updater/installer install a version optimized for the CPU its going to be installed on.


It's not that uncommon to run one system on multiple CPUs. People swap out the CPU in their desktops, people move a drive from one laptop to another, people make bootable USB sticks, people set up a system in a chroot on a host machine and then flash a target machine with the resulting image.


Detect that on launch and use the updater to reinstall.


Congratulations, you now need to make sure your on-launch detector is compatible with the lowest common denominator, while at the same time being able to detect modern architectures. You also now carry 10 different instances of firefox.exe to support people eventually running on Itanium, people that will open support requests and expect you to fix their abandoned platform.

For what reason, exactly ?

You want 32b x86 support: pay for it. You want <obscure architecture> support: pay for it. If you're ok with it being a fork, then maintain it.


The "Chrome Commit Tracker" linked is a pretty interesting set of visualizations that I hadn't come across before. Makes it a lot easier to get a feel for the sizes of the various teams, and how they change over time.

https://chrome-commit-tracker.arthursonzogni.com/organizatio...


That's crazy. I knew Google's would be a lot bigger, but DANG.


WebKit runs on Windows, the Windows port just needs work to bring it up to the level of the Linux port. I got every JIT tier enabled in JavaScriptCore [1] and enabled libpas (the memory allocator). The Windows port is moving to Skia in line with the Linux port.

Really just needs more people (and companies) pushing it forward. Hopefully Kagi will be contributing improvements to the Windows port upstream.

[1] https://iangrunert.com/2024/10/07/every-jit-tier-enabled-jsc...


I recently ported WebKit's libpas memory allocator[1] to Windows, which used pthreads on the Linux and Darwin ports. Depending on what pthreads features you're using it's not that much code to shim to Windows APIs. It's around ~200 LOC[2] for WebKit's usage, which a lot smaller than pthread-win32.

[1] https://github.com/WebKit/WebKit/pull/41945 [2] https://github.com/WebKit/WebKit/blob/main/Source/bmalloc/li...


At the time (11 years ago) I wanted this to run on Windows XP.

The APIs you use there (e.g. SleepConditionVariableSRW()) were only added in Vista.

I assume a big chunk of pthread emulation code at that time was implementing things like that.


These VirtualAlloc's may intermittently fail if the pagefile is growing...


Ah yeah, I see Firefox ran into that and added retries:

https://hacks.mozilla.org/2022/11/improving-firefox-stabilit...

Seems like a worthwhile change, though I'm not sure when I'll get around to it.


This is something you also need to do for other Win32 APIs, e.g. file write access may be temporarily blocked by anti-virus programs or whatever and not handling that makes unhappy users.


Never knew about the destructor feature for fiber local allocations!


I think the author was happy to be employed by a megacorp, along with a team to push jemalloc forward.

He and the other previous contributors are free to find new employers to continue such an arrangement, if any are willing to make that investment. Alternatively they could cobble together funding from a variety of smaller vendors. I think the author is happy to move on to other projects, after spending a long time in this problem space.

I don’t think that “don’t let one megacorp hire a team of contributors for your FOSS project” is the lesson here. I’d say it’s a lesson in working upstream - the contributions made during their Facebook / Meta investment are available for the community to build upon. They could’ve just as easily been made in a closed source fork inside Facebook, without violating the terms of the license.

Also Mozilla were unable to switch from their fork to the upstream version, and didn’t easily benefit from the Facebook / Meta investment as a result.


The gap between the Windows and GTK ports is shrinking. Every JIT tier has been enabled for JSC on Windows[1], and libpas (the custom memory allocator) should get enabled soon.

The Windows port is moving from Cairo to Skia soon as well, matching the GTK port (though I think the focus is enabling the CPU renderer to start).

Webkit's CI (EWS) is running the layout tests on Windows, and running more tests on Windows is mostly a matter of funding the hardware.

There's a few things still disabled on the Windows port, some rough edges, and not a lot of production use (Bun and Playwright are the main users). It'd definitely be more work than Linux, but it's not as bad as you'd think.

[1] https://iangrunert.com/2024/10/07/every-jit-tier-enabled-jsc...


That’s great to hear. The more web engines are practical to use across all major platforms the better.


While the modern web is complicated, there's a few things working in Ladybird's favor.

Web Platform Tests (1) make it significantly easier to test your compliance with W3C standards. You don't have to reverse engineer what other engines are doing all the time.

The standards documents themselves have improved over time, and are relatively comprehensive at this point. Again, you don't have to reverse engineer what other engines are doing, the spec is relatively comprehensive.

Ladybird has chosen to not add a JIT compiler for JS and Wasm, reducing complexity on the JS engine. They're already reached (or exceeded) other JS engines on the ECMAScript Test Suite Test262 (2).

There's a big differential between the level of investment in Chromium and the other engines - in part because Chrome / Chromium are often doing R&D to build out new specifications, which is more work than implementing a completed specification. There's also a large amount of work that goes into security for all three major engines - which (for now) is less of a concern for Ladybird.

I'm confident that the Ladybird team will hit their goal of Summer 2026 for a first Alpha version on Linux and macOS. They'll cut a release with whatever they have at that point - it's already able to render a large swathe of the modern web, and continues to improve month-on-month.

(1) https://web-platform-tests.org/ (2) https://test262.fyi/


The Chromium codebase also implements requirements that you may not need to take on for just a web browser, e.g. all of the infrastructure to make it ChromeOS, including for example being a Wayland compositor and a lot of other stuff. The projects are somewhat apples to oranges.


Ladybird does have another slight advantage in that it only has an interpreter for JS and wasm, instead of maintaining multiple tiers of JIT compilation for both. That choice materially reduces the surface area for exploits.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: