Hacker Newsnew | past | comments | ask | show | jobs | submit | Jemaclus's commentslogin

"Better than JSON" is a pretty bold claim, and even though the article makes some great cases, the author is making some trade-offs that I wouldn't make, based on my 20+ year career and experience. The author makes a statement at the beginning: "I find it surprising that JSON is so omnipresent when there are far more efficient alternatives."

We might disagree on what "efficient" means. OP is focusing on computer efficiency, where as you'll see, I tend to optimize for human efficiency (and, let's be clear, JSON is efficient _enough_ for 99% of computer cases).

I think the "human readable" part is often an overlooked pro by hardcore protobuf fans. One of my fundamental philosophies of engineering historically has been "clarity over cleverness." Perhaps the corollary to this is "...and simplicity over complexity." And I think protobuf, generally speaking, falls in the cleverness part, and certainly into the complexity part (with regards to dependencies).

JSON, on the other hand, is ubiquitous, human readable (clear), and simple (little-to-no dependencies).

I've found in my career that there's tremendous value in not needing to execute code to see what a payload contains. I've seen a lot of engineers (including myself, once upon a time!) take shortcuts like using bitwise values and protobufs and things like that to make things faster or to be clever or whatever. And then I've seen those same engineers, or perhaps their successors, find great difficulty in navigating years-old protobufs, when a JSON payload is immediately clear and understandable to any human, technical or not, upon a glance.

I write MUDs for fun, and one of the things that older MUD codebases do is that they use bit flags to compress a lot of information into a tiny integer. To know what conditions a player has (hunger, thirst, cursed, etc), you do some bit manipulation and you wind up with something like 31 that represents the player being thirsty (1), hungry (2), cursed (4), with haste (8), and with shield (16). Which is great, if you're optimizing for integer compression, but it's really bad when you want a human to look at it. You have to do a bunch of math to sort of de-compress that integer into something meaningful for humans.

Similarly with protobuf, I find that it usually optimizes for the wrong thing. To be clear, one of my other fundamental philosophies about engineering is that performance is king and that you should try to make things fast, but there are certainly diminishing returns, especially in codebases where humans interact frequently with the data. Protobufs make things fast at a cost, and that cost is typically clarity and human readability. Versioning also creates more friction. I've seen teams spend an inordinate amount of effort trying to ensure that both the producer and consumer are using the same versions.

This is not to say that protobufs are useless. It's great for enforcing API contracts at the code level, and it provides those speed improvements OP mentions. There are certain high-throughput use-cases where this complexity and relative opaqueness is not only an acceptable trade off, but the right one to make. But I've found that it's not particularly common, and people reaching for protobufs are often optimizing for the wrong things. Again, clarity over cleverness and simplicity over complexity.

I know one of the arguments is "it's better for situations where you control both sides," but if you're in any kind of team with more than a couple of engineers, this stops being true. Even if your internal API is controlled by "us," that "us" can sometimes span 100+ engineers, and you might as well consider it a public API.

I'm not a protobuf hater, I just think that the vast majority of engineers would go through their careers without ever touching protobufs, never miss it, never need it, and never find themselves where eking out that extra performance is truly worth the hassle.


If you want human readable, there are text representations of protobuf for use at rest (checked in config files, etc.) while still being more efficient over the wire.

In terms of human effort, a strongly typed schema rather than one where you have to sanity check everything saves far more time in the long run.


Great writing, thanks. There are of course 2 sides as always. I think especially for larger teams and large projects Protobuf in conjunction with gRPC can play wisely with the backwards compatibility feature, which makes it very hard to break things.


Yes to all of this.

Also the “us” is ever-changing in a large enough system. There are always people joining and leaving the team. Always, many people are approximately new, and JSON lets them discover more easily.


IIRC, AvatarMUD (avatar.outland.org) has 20,000+ rooms in it. It's been a long time since I played, but it's absolutely massive!


Yes, I remembered that some MUDs touted much more rooms. I've found Aarchon (15K) [1] and SlothMUD (23K) [2]. And that's just the number of rooms, the other numbers (8K mobs, 7K items, 12K NPCs) are massive as well.

But big MUDs usually have "builder" teams so the comparison is unfair. However, these numbers have hardly been matched by game studios - WoW or Runescape comes to mind. And then there's Dwarf Fortress which reaches the infinite in some categories thanks to procedural generation.

[1] https://www.aarchonmud.com/arc/features

[2] https://www.mudportal.com/listings/by-genre/hack-slash/item/...


Yah, aardwolf (the one o played in ancient times) apparently has 35,000 rooms.

Collaborative building over decades adds up!


I'm also deaf, and I took 14 years of speech therapy. I grew up in Alabama. The only way you would know I'm from the South is because of the pin-pen merger[1]. Otherwise, you'd think I grew up in the American Midwest, due to how my speech therapy went. Almost nobody picks up on it, unless they are linguists that already knew about the pin-pen merger.

[1]https://www.acelinguist.com/2020/01/the-pin-pen-merger.html


I’m aware of the merger, but I literally can’t hear a difference between the words. I certainly pronounce them the same way.

I also think merry-marry-Mary are all pronounced identically. The only way I can conceive of a difference between them is to think of an exaggerated Long Island accent, which, yeah, I guess is what makes it an accent.


That's exactly what the pin-pen merger is! As you know, it's not limited to pin/pen, and hearing ability (in my case, profound hearing loss) is not related to the ability to hear the difference. I don't understand the linguistics, but my very bad understanding is that there's actual brain chemistry here that means that you _can't_ hear the difference because you never learned it, never spoke it, and you pronounce them the same.

My partner is from the PNW and she pronounces "egg" as "ayg" (like "ayyyy-g") but when I say "egg" she can't hear the difference between what I'm saying and what she says. And she has perfect hearing. But she CAN hear the difference between "pin" and "pen", and she gets upset when i say them the same way. lol

But yeah, that's one of the things that makes accents accents. It's not just the sounds that come out of our mouths but the way we hear things, too. Kinda crazy. :)


When I was listening to some of the samples on the page you linked (pronunciation of “when”), it really seemed to me like the difference they were highlighting was how much the “h” was pronounced. Even knowing what I was listening for, it was very like my brain was just refusing to recognize the vowel sound distinction. So I think you must be right about it being a matter of basic brain chemistry.

In the example of the reverse pen/pin merger (HMS Pinafore) on that page, I couldn’t hear “penafore” to save my life. Fascinating stuff.

I used to think of the movie “Fargo” and think “haha comical upper midwestern accents.” And then at some point I realized that the characters in “No Country for Old Men” probably must sound similarly ridiculous to anyone whose grandparents and great grandparents didn’t all speak with a deep, rural West Texas accent - which mine did, so watching the movie it just seemed completely natural for the place and time at a deeply subconscious level.


They are the same phoneme for me in US Eastern suburbia, the only difference is in a subtle shift in the length that you drag it out. "merry" is faster than "marry" which is sometimes but not always faster than "Mary". Most UK accents seems to drag the proper name out an additional beat, and for some of them there's a slight pitch shift that sounds like "ma-ery", at its most extreme in Ireland (this is one early shibboleth by which I recognized Irish people before I really picked up on the other parts of the accent).


As someone with a German accent, to me the difference between merry and marry is the same as between German e (in this case ɛ in ipa) and ä (æ in ipa). Those two sounds are extremely close, but not quite the same. According to the Oxford dictionary that is true in British English, while it shows the same pronunciation (ɛ) for both in American English


This is WILD. I love it. Congrats on shipping!


Thank you! Shipping for the first time was definitely nerve-wracking. Really appreciate the positive feedback!


You should install it, because it's exactly what you just described.

Edit: From a UI perspective, it's exactly what you described. There's a dropdown where you select the LLM, and there's a ChatGPT-style chatbox. You just docker-up and go to town.

Maybe I don't understand the rest of the request, but I can't imagine a software where a webpage exists and it just magically has LLMs available in the browser with no installation?


It doesn't seem exactly like what they are describing. The end-user interface is what they are describing but it sounds like they want the actual LLM to run in the browser (perhaps via webgpu compute shaders). Open WebUI seems to rely on some external executor like ollama/llama.cpp, which naturally can still be self-hosted but they are not executing INSIDE the browser.


Does that even exist? It's basically what they described but with some additional installation? Once you install it, you can select the LLM on disk and run it? That's what they asked for.

Maybe I'm misunderstanding something.


Apparently it does, though I'm learning about it for the first time in this thread also. Personally, I just run llama.cpp locally in docker-compose with anythingllm for the UI but I can see the appeal of having it all just run in the browser.

  https://github.com/mlc-ai/web-llm
  https://github.com/ngxson/wllama


Oh, interesting. Well, TIL.


> You should install it, because it's exactly what you just described.

Not OP, but it really isn't what' they're looking for. Needing to install stuff VS simply going to a web page are two very different things.


Probably because it's intentional. There are many theories why, but one might be that by saying "You're absolutely right," they are priming the LLM to agree with you and be more likely to continue with your solution than to try something else that might not be what you want.


I think you should ask yourself that last question again, and this time really think about it. Why /do/ all companies seem to have managers?

(Tone clarification: I'm not approaching this in a condescending manner, but more of a "let's talk through this problem out loud and see where it gets us." So please don't take this as condescension.)

One way to think about "obvious" solutions to problems, such as a "no manager" solutuon, is this: if it's so obvious, why is no one doing it? For example, I worked for a grocery delivery startup for a while. Every single new hire, without fail, would show up at the end of the first week and say "I have a great idea, why don't we let users shop by recipe?"

On its face, it sounds like a brilliant idea! One intuition-based shortcut to find the answer is: if that's such an obvious thing, why doesn't Amazon or Kroger or Safeway or HEB or any of the major grocery chains let you do that?

And of course, the answer is: that's not how users shop. If it worked, the big players would be doing it. They're not. Are you smarter than Amazon? Probably not. That's not to say a smaller group can't innovate past Amazon, but Amazon has some /really fucking smart people/ working for them, and the odds are fantastically small that you'll out-think them. (You can certainly out-_pivot_ them by doing something faster than they can, but if it turns out to be valuable, in the long run, they'll do it too.)

So when you approach a conversation like this and say, "Maybe just see what a group of seniorish people think?", one way to do a quick sanity check on it is: can you think of successful companies that are run that way?

You probably can't. I certainly can't.

There's a similar problem in the theatre world. It is universally understood that someone doing a 60 second monologue for an audition is _the worst way to evaluate theatrical performance_... except for everything else.

And similarly, it appears, based on scanning the successful companies, that having managers is possibly also the worst way to ensure performance... except for everything else.

So... managers it is. It's unlikely that there's a better way to do this at scale. Many people have tried. Management chains always win.


I agree with your broader consensus, but, for that example you give, HEB absolutely lets you shop by recipe on their website.


But the larger point is still true: For every "why don't they just do X? It's so obvious!" you can look around and note that [almost] /nobody/ is doing that, and that should be a pretty big signal about the idea.

My original post is about the intuition behind how to approach questions like that. Whenever anyone says "Why don't they just do $OBVIOUS_THING?" the answer is "because nobody is doing it."

Now, with respect to that particular feature, I can provide some personal experience to explain /why/ [almost] nobody has that feature. As a disclaimer, I don't know HEB's website. They were just someone we dealt with, and I don't live in their service area, so it's interesting that they have the feature.

What I can tell you from experience is that it would not be a significant driver of revenue, certainly not enough to be a majorly supported feature by a major company that has other value props out there. By far the biggest revenue driver for a grocery company are the staples that people buy every single week: the same milk, the same bread, the same cereal, the same ground beef, the same mac 'n cheese.

People, as a general population, are not adventurous at home When you want something new and interesting, you go out to a restaurant. When you want something familiar and comfortable and, most importantly, easy, you make it at home. I would hazard that the number of times that the average American family of four would cook a brand new recipe they've never had before is probably less than a dozen times /per year/.

So if you're a company that gets >90% of its revenue from weekly recurring users and staples, and <1% of its revenue from recipe-driven results, and you have limited resources, which of those do you think you should focus on? Obviously, you focus on the former. A 10% increase in staples sales is worth millions and millions of dollars, whereas a 10% increase in recipe sales is worth, maybe, a few hundred thousand dollars. It's not nothing, but it's not really worth it from an ROI perspective. Maybe if it's a set-it-and-forget-it kind of feature, it might work?

But over the long run you'll have to come back and upgrade dependencies and migrate to the newest framework du jour, and blah blah blah, and the next thing you know you have a team of four full-time engineers working on a feature that brings in half their salary.

HEB doing it is, well, it's interesting. I am supremely confident it is not a significant revenue driver. It might be something that increases NPS scores or something to that effect, but it's not going to move the needle on revenue very much. So it's interesting that they have the feature.

So if you take all of that into account -- you'll just have to trust me that I know what I'm talking about, I'm sorry about that -- then you can see that someone saying "What if we just got senior people in a room to see what they think?" is a question that doesn't deserve much attention. Not because it's a dumb idea, but because it's actually an interesting idea that has no merits when you dig down and look at it.

If it worked, as the intuition goes, the industry at-large would be doing it. And the evidence bears that out.


Yes, I take your point.

But if I said "I don't understand the point of brakes. Why do all cars get made with brakes?", then as well as making your point ("look, do you really think you know better than Stellantis!?") there's also a straightforward answer which is "cars need brakes so they can stop instead of killing people."

What's the "straightforward answer" case for the existence of managers? Your answer just suggests that such an answer does exist, without revealing what it is.


Sort of in the same vein as "you don't need to understand gravity to recognize that it's important," you not understanding what the answer is doesn't mean there isn't an answer and that the answer isn't important.

I don't have "the answer" (I have _an_ answer, see below), and I also don't need to know "the answer" in order to understand that the managerial class doesn't exist for shits and giggles. There's value there.

If Bezos thought getting rid of managers at Amazon would make him another half a billion dollars, you bet your ass he'd do it.

My answer? It's exactly what that group of seniorish people would do: make decisions. But the seniorish people can't make decisions all day and ALSO do the things they're senior at. You may not like that answer -- and you don't have to! -- but "making decisions" is something that needs to get done at scale without sacrificing the actual productive work that ICs do.

But again, I think you're asking a great question, and I think there's room to say "Is the current paradigm the best paradigm?" and explore other alternatives.

But the very clear answer from all research in addition to basic intuition is "As far as we know, yes."

Why? Doesn't really matter. We just know that if we didn't have managers, the world as we know it wouldn't exist. (For better or for worse!)


I'll second this. It's fantastic.


> This isn't academic nit-picking. It's how medical research works when lives are on the line. Your startup's growth deserves the same rigor.

But does it, really? A lot of companies sell... well, let's say "not important" stuff. Most companies don't cost peoples' lives when you get it wrong. If you A/B test user signups for a startup that sells widgets, people aren't living or dying based on the results. The consequences of getting it wrong are... you sell fewer widgets?

While I understand the overall point of the post -- and agree with it! -- I do take issue with this particular point. A lot of companies are, arguably, _too rigorous_ when it comes to testing.

At my last company, we spent 6 weeks waiting for stat sig. But within 48 hours, we had a positive signal. Conversion was up! Not statistically significant, but trending in the direction we wanted. But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.

Note: I'm not advocating stopping tests as soon as something shows trending in the right direction. The third scenario on the post points this out as a flaw! I do like their proposal for "peeking" and subsequent testing.

But, really, let's just be realistic about what level of "rigor" is required to make decisions. We aren't shooting rockets into space. We're shipping software. We can change things if we get them wrong. It's okay. The world won't end.

IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals. If its goals are "stat sig on every test," then sure, treat it like someone might die if you're wrong. (I would argue that you have the wrong goals, in this case, but I digress...)

But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.


Completely agree. The sign up flow for your startup does not need the same rigor as medical research. You don’t need transportation engineering standards for your product packaging, either. They’re just totally different levels of risk.

I could write pages on this (I’ve certainly spoken for hours) but the adoption of a scientific research mindset is very limiting for A/B testing. You don’t need all the status quo bias of null hypothesis testing.

At the same time, it’s quite impressive how people are able to adapt. An organization experienced with A/B testing will start doing things like multi variate correction in their heads.

For anyone spinning this stuff up, go Bayesian from the start. You’ll end up there, whether you realize it or not. (People will look at p-values in consideration of prior evidence).

0.05 (or any Bayesian equivalent) is not a magic number. It’s really quite high for a default. Harder sciences (the ones not in replication crisis) use much stricter values by default.

Adjust the confidence required to the cost of the change and the risk of harm. If you’re at the point of testing, the cost of change may be zero (content). It may be really high, it may be net negative!

But in most cases, at a startup, you should be going after wins that are way more impactful and end up having p-values lower than 0.05, anyway. This is easy to say, but don’t waste your time coming up with methods to squeeze out more signal. Just (just lol) make better changes to your product so that the methods don’t matter. If p=0.00001, that’s going to be a better signal than p=0.05 with every correction in this article.

If you’re going to pick any fanciness from the start (besides Bayes) make it anytime valid methods. You’re certainly already going to be peaking (as you should) so have your data reflect that.


> You don’t need all the status quo bias of null hypothesis testing.

You don't have to make the status quo be the null hypothesis. If you make a change, you probably already think that your change is better or at least neutral, so make that the null. If you get a strong signal that your change is actually worse, rejecting the null, revert the change.

Not "only keep changes that are clearly good" but "don't keep changes that are clearly bad."


This is a reasonable approach, particularly when you’re looking at moving towards a bigger redesign that might not pay off right away. I’ve seen it called “non-inferiority test,” if you’re curious.


Especially for startups with a small user base.

Not many users means that getting to stat sig will take longer (if at all).

Sometimes you just need to trust your design/product sense and assert that some change you’re making is better and push it without an experiment. Too often people use experimentation for CYA reasons so they can never be blamed for making a misstep


100% this. I’ve seen people get too excited to A/B test everything even when it’s not appropriate. For us, changing prices was a common A/B test when the relatively low number of conversions meant the tests took 3 months to run! I believe we’ve moved away from that, now.

The company has a large user base, it’s just SaaS doesn’t have the same conversion # as, say, e-commerce.


The idea you should be going after bigger wins than .05 misses the point. The p value is a function of the effect size and the sample size. If you have a big effect you’ll see it even with small data.

Completely agree on the Bayesian point though, and the importance of defining the loss function. Getting people used to talking about the strength of the evidence rather than statistical significance is a massive win most of the time.


> If you have a big effect you’ll see it even with small data.

That’s in line with what I was saying so I’m not sure where I missed the point.

P-value a function of effect size, variance and sample size. Bigger wins would be those that have a larger effect and more consistent effect, scaled to the number of users (or just get more users).


> But in most cases, at a startup, you should be going after wins that are way more impactful and end up having p-values lower than 0.05, anyway.

This was the part I was quibbling with. The size of the p value is pretty much irrelevant unless you know how much data you are collecting. The p values might always be about ~.05 if you know the effects are likely large and powered the study appropriately.


It does, if you assume you care about the validity of the results or about making changes that improve your outcomes.

The degree of care can be different in less critical contexts, but then you shouldn’t lie to yourself about how much you care.


But there’s an opportunity cost that needs to be factored in when waiting for a stronger signal.


One solution is to gradually move instances to you most likely solution.

But continue a percentage of A/B/n testing as well.

This allows for a balancing of speed vs. certainty


do you use any tool for this, or simply crunk up slightly the dial each day


There are multi armed bandit algorithms for this. I don’t know the names of the public tools.

This is especially useful for something where the value of the choice is front loaded, like headlines.


We've used this Python package to do this: https://github.com/bayesianbandits/bayesianbandits


There is but you can decide that up front. There’s tools that will show you how long it’ll take to get statistical significance. You can then decide if you want to wait that long or have a softer p-value.


Even if you have to be honest with yourself about how much you care about being right, there’s still a place for balancing priorities. Two things can be true at once.

Sometimes someone just has to make imperfect decisions based on incomplete information, or make arbitrary judgment calls. And that’s totally fine… But it shouldn’t be confused with data-driven decisions.

The two kinds of decisions need to happen. They can both happen honestly.


I don't think I'm making the case that you shouldn't test things or care about the results, but rather a matter of degree of risk that should be acceptable. In medicine, if you get it wrong, people /die/. In software, if you get it wrong, /you sell fewer widgets/. That's a pretty major difference. You can't get it wrong in medicine, but you /can/ get it wrong in software without it being catastrophic failure.

I'm basically making the case that "Your startup deserves the same rigor [as medical testing]" is making a pretty bold assertion, and that the reality is that most of us can get away with much less rigor and still get ahead in terms of improving our outcomes.

In other words, it's still A/B testing if your p-value is 0.10 instead of 0.05. There's nothing magical about the 0.05 number. Most startups could probably get away with a 20% chance of being wrong on any particular test and still come out ahead. (Note: this assumes that the thing your testing is good science -- one thing we aren't talking about is how many tests are actually changing many variables at once and maybe that's not great!)


Can this be solved by setting p=0.50?

Make your expectations explicit instead of implicit. 0.05 is completely arbitrary. If you are comfortable with a 50/50 chance of being right, make your threshold less rigorous.


I think at that point you may as well skip the test and just make the change you clearly want to make!


Or collect some data and see if the net effect is positive. It’s possibly worth collecting some data though to rule out negative effects?


Absolutely, you can still analyse the outcomes and try to draw conclusions. This is true even for A/B testing.


I see where you are coming from, and overtesting is a thing, but I really believe that the baseline of quality of all software out there is terrible. We are just so used to it and it's been normalized. But there is really no day going by during which I'm not annoyed by a bug that somebody with more attention to quality would have not let through.

It's not about space rocket type of rigor, but it's about a higher bar than the current state.

(Besides, Elon's rockets are failing left and right, in contrast to what NASA achieved in the 60s, so there are some lessons there too.)


I think there's a pretty big difference between QA (letting bugs go by) and A/B testing, and your post appears to me to be conflating the two. I would argue that you are better off spending your time QAing a feature that you have high confidence is positive ROI, than spending weeks waiting for an A/B test to reach stat sig.

I don't disagree with your statement, I just think you are addressing a different problem from A/B testing and statistical significance.


The thing is though, you're just as likely to be not improving things.

I think we can realize another reason to just ship it. Startups need to be always moving. You need to keep turning the wheel to help keep everyone busy and keep them from fretting about your slow growth or high churn metrics. Startups need lots of fighting spirit. So it's still probably better to ship it rather than admit defeat and suffer bad vibes.


Allow me to rephrase what I think you’re saying:

Startups need to ship because they need to have a habit of moving constantly to survive. Stasis is death for a startup.


> The consequences of getting it wrong are... you sell fewer widgets?

If that’s the difference between success and failure then that is pretty important to you as a business owner.

> do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive

That’s a reasonable, and in plenty of contexts the absolute best, approach to take. But don’t call it A/B testing, because it’s not.


Absolutely. If you're the business owner, selling fewer widgets is Very Bad!

But in my post, I specifically called out a line in OP's article that I disagreed with: (paraphrasing) "Your startup deserves the same rigor as medical testing."

To clarify -- and support your point --, we're shipping software, not irreversible medical procedures. If you get it wrong, you sell fewer widgets /temporarily/ and you revert back to a known better solution. With medicine, there aren't necessarily take-backsies -- but there absolutely are in software. Reverting deploys is something all of us do quite regularly!

Is it A/B testing? Maybe, maybe not. I'm not a data scientist. But I think saying that your startup deserves the same rigor as a medical test is misleading at best and harmful at worst.

I just think companies should be more okay with educated risks, rather than waiting days, weeks, months for statistical significance on a feature that has little chance of actually having a negative impact. As you said elsewhere in the thread, for startups, stasis is death.

(BTW, I've read a lot of your other comments in the thread. I think we're pretty well aligned!)


Yes 100% this. If you're comparing two layouts there's no great reason to treat one as a 'treatment' and one as a 'control' as in medicine - the likelihood is they are both equally justified. If you run an experiment and get p=0.93 on a new treatment - are you really going to put money on that result being negative, and not updating the layout?

The reason we have this stuff in medicine is because it is genuinely important, and because a treatment often has bad side-effects, it's worse to give someone a bad treatment than to give them nothing, that's the point of the Hypocratic oath. You don't need this for your dumb B2C app.


The other thing is that in those medical contexts, the choice is often between "use this specific treatment under consideration, or do nothing (i.e., use existing known treatments)". Is anyone planning to fold their startup if they can't get a statistically significant read on which website layout is best? Another way to phrase "do no harm" is to say that a null result just means "there is no reason to change what you're doing".


> We aren't shooting rockets into space.

Most of us don't, indeed. So still aligned with your perspective, it's good to take in consideration what we are currently working on, and what will be the possible implication. Sometimes the line is not so obvious though. If we design a library or framework which is not very specific to a inconsequential outcome, it's no longer obvious what policy make the more sense.


It's not a matter of life and death, I agree - to some extent. Startups have very limited resources, and ignoring inconclusive results in the long term means you're spending these resources without achieving any bottom line results. If you do that too much/too long, you'll run out of funding and the startup will die.

The author didn't go into why companies do this (ignoring or misreading test results). Putting lack of understanding aside, my anecdotal experience from the time I worked as a data scientist boils down to a few major reasons:

- Wanting to be right. Being a founder requires high self-confidence, that feeling of "I know I'm right". But feeling right doesn't make one right, and there's plenty of evidence around that people will ignore evidence against their beliefs, even rationalize the denial (and yes, the irony of that statement is not lost on me); - Pressure to show work: doing the umpteenth UI redesign is better than just saying "it's irrelevant" in your performance evaluation. If the result is inconclusive, the harm is smaller than not having anything to show - you are stalling the conclusion that your work is irrelevant by doing whatever. So you keep on pushing them and reframing the results into some BS interpretation just to get some more time.

Another thing that is not discussed enough is what all these inconclusive results would mean if properly interpreted. A long sequence of inconclusive UI redesign experiments should trigger a hypothesis like "does the UI matter"? But again, those are existentially threatening questions for the people in the best position to come up with them. If any company out there were serious about being data-driven and scientific, they'd require tests everywhere, have external controls on quality and rigour of those and use them to make strategic decisions on where they invest and divest. At the very least, take them as a serious part of their strategy input.

I'm not saying you can do everything based on tests, nor that you should - there are bets on the future, hypothesis making on new scenarios and things that are just too costly, ethically or physically impossible to test. But consistently testing and analysing test results could save a lot of work and money.


Excellent response. Thank you!


> Most companies don't cost peoples' lives when you get it wrong.

True, but it usually costs money to fix it. I think the themes of "this only matters if lives are on the line" or "it's too rigorous" are straw-men.

We have limited resources -- time, money, people. We'd like to avoid deploying those resources badly. Statistical inference can be one way to give us more information so we avoid using our resources badly, but as you note, statistical inference also has costs: we have to spend resources to get the data we need to do the inference, plus other costs. We can estimate the costs of getting sufficient data using sample size estimation methods. For go/no-go decision-making, if the cost of getting the decision wrong isn't something like at least 10x the cost of doing the statistical inference, I don't think it's worth doing the inference. It may be worth doing the inference for _other_ reasons, but those reasons are out of scope.

As an example, a common use of statistical inference in medical research is to compare the efficacy of a treatment with a placebo. Some of the motivation is to decide whether to invest more resources in developing the treatment, not because people will die if they get a false positive stating that the treatment is effective when it isn't.

> A lot of companies are, arguably, _too rigorous_ when it comes to testing.

My experience in industry has been the opposite. Companies like the idea of data-driven decision-making, but then they discover pain points. They should have some idea of how much of a change they're looking to detect (i.e., an effect size). They should estimate how much data they're likely to need to run their tests (i.e., sample size estimation). They have to consider other issues like model misfit, calibration, multiple-testing corrections, and so on. Then they also have to rig up the infra to be able to _do_ the testing, collect the data, analyze the results, and communicate the results to their internal stakeholders. These pain points are why companies like Eppo and StatSig exist -- A/B testing ends up being more high-touch than developers expect.

Messing up any one of these issues can yield "flaky tests," which developers hate. Failing to gather a sufficiently large sample size for a given effect size is a pretty common failure mode.

> But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.

It's difficult to tell precisely what you mean by "maintain rigor" here. The only context I can gather is that whatever procedure you were using needed more data in order to satisfy the preconditions of the test needed for the nominal design criteria of the test -- usually, its nominal false positive rate. I don't think this is an issue of rigor -- it's an issue of statistical modeling and correctness.

Sometimes, it's possible to use different methods that may require less data at the cost of more (or different) modeling assumptions. Failing to satisfy the assumptions of a test can increase its false positive rate. Whether that matters is really up to you.

> I do like their proposal for "peeking" and subsequent testing.

What the post is suggesting is not a proposal, but a standard class of frequentist statistical inference methods called sequential testing. Daniël Lakens has a good online textbook (https://lakens.github.io/statistical_inferences/) that briefly discusses these methods in Chapter 10 and provides further references.

> We're shipping software. We can change things if we get them wrong.

That's usually true -- as long as you have the resources needed to make those changes, and are willing to spend them that way.

> IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals.

While I don't disagree with the sentiment, I think you're conflating rigor with correctness here.

> If its goals are "stat sig on every test", then sure, treat it like someone might die if you're wrong.

I think that's a false equivalence. Even the American Statistical Association has issued a statement on p-values (see https://www.amstat.org/asa/files/pdfs/p-valuestatement.pdf) that includes "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold."

> But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.

If those are your goals, just ship it; I don't think it makes sense to justify the effort to test in this situation, especially if, as you argue, it's financially feasible to roll back the change or pivot if it doesn't work.


I think you're being overly pedantic here. I'm not a data scientist, just an engineering manager who is frustrated with data scientists ;)

That said, I do appreciate your corrections, but I don't think anything you said fundamentally changes my philosophical approach to these problems.


For large applications in a service-oriented architecture, I leverage Kafka 100% of the time. With Confluent Cloud and Amazon MSK, infra is relatively trivial to maintain. There's really no reason to use anything else for this.

For smaller projects of "job queues," I tend to use Amazon SQS or RabbitMQ.

But just for clarity, Kafka is not really a message queue -- it's a persistent structured log that can be used as a message queue. More specifically, you can replay messages by resetting the offset. In a queue, the idea is once you pop an item off the queue, it's no longer in the queue and therefore is gone once it's consumed, but with Kafka, you're leaving the message where it is and moving an offset instead. This means, for example, that you can have many many clients read from the same topic without issue.

SQS and other MQs don't have that persistence -- once you consume the message and ack, the message disappears and you can't "replay it" via the queue system. You have to re-submit the message to process it. This means you can really only have one client per topic, because once the message is consumed, it's no longer available to anyone else.

There are pros and cons to either mechanism, and there's significant overlap in the usage of the two systems, but they are designed to serve different purposes.

The analogy I tend to use is that Kafka is like reading a book. You read a page, you turn the page. But if you get confused, you can flip back and reread a previous page. An MQ like RabbitMQ or Sidekiq is more like the line at the grocery store: once the customer pays, they walk out and they're gone. You can't go back and re-process their cart.

Again, pros and cons to both approaches.

"What didn't work out?" -- I've learned in my career that, in general, I really like replayability, so Kafka is typically my first choice, unless I know that re-creating the messages are trivial, in which case I am more inclined to lean toward RabbitMQ or SQS. I've been bitten several times by MQs where I can't easily recreate the queue, and I lose critical messages.

"Where did you regret adding complexity?" -- Again, smaller systems that are just "job queues" (versus service-to-service async communication) don't need a whole lot of complexity. So I've learned that if it's a small system, go with an MQ first (any of them are fine), and go with Kafka only if you start scaling beyond a single simple system.

"And if you stuck with a DB-based queue -- did it scale?" -- I've done this in the past. It scales until it doesn't. Given my experience with MQs and Kafka, I feel it's a trivial amount of work to set up an MQ/Kafka, and I don't get anything extra by using a DB-based queue. I personally would avoid these, unless you have a compelling reason to use it (eg, your DB isn't huge, and you can save money).


> This means you can really only have one client per topic, because once the message is consumed, it's no longer available to anyone else.

It depends on your use case (or maybe what you mean by "client"). If I just have a bunch of messages that need to be processed by "some" client, then having the message disappear once a client has processed it is exactly what you want.


Absolutely, if you only ever have one client, SQS or a message queue is perfectly fine!


We build applications very differently. SQS queues with 1000s of clients have been a go to for me for over a decade. And the opposite as well — 1000s of queues (one per client device, they’re free). Zero maintenance, zero cost when unused. Absurd scalability.


Certainly. There are many paths to victory here.

One thing to consider is whether you _want_ your producers to be aware of the clients or not. If you use SQS, then your producer needs to be aware of where it's sending the message. In event-driven architecture, ideally producers don't care who's listening. They just broadcast a message: "Hey, this thing just happened." And anyone who wants to subscribe can subscribe. The analogy is a radio tower -- the radio broadcaster has no idea who's listening, but thousands and thousands of people can tune in and listen.

Contrast to making a phone call, where you have to know who it is that you're dialing and you can only talk to one person at a time.

There are pros and cons to both, but there's tremendous value in large applications for making the producer responsible for producing, but not having to worry about who is consuming. Particularly in organizations with large teams where coordinating that kind of thing can be a big pain.

But you're absolutely right: queues/topics are basically free, and you can have as many as you want! I've certainly done it the SQS way that you describe many times!

As I mentioned, there are many paths to victory. Mine works really well for me, and it sounds like yours works really well for you. That's fantastic :)


Hey I'm curious how the consumers of those queues typically consume their data, is it some job that is polling, another piece of tech that helps scale up for bursts of queue traffic, etc. We're using the google equivalent and I'm finding that there are a lot of compute resources being used on both the publisher and subscriber sides. The use cases here I'm talking about are mostly just systems trying to stay in sync with some data where the source system is the source of record and consumers are using it for read-only purposes of some kind.


On the producer side I’d expect to see change data capture being directed to a queue fairly efficiently, but perhaps you have some intermediary that’s running between the system of record and the queue? The latter works, but yeah it eats compute.

On the consumer side the duty cycle drives design. If it’s a steady flow then a polling listener is easy to right size. If the flow is episodic (long periods of idle with unpredictable spikes of high load) one option is to put a alarm on the queue that triggers when it goes from empty to non-empty, and handle that alarm by starting the processing machinery. That avoids the cost of constantly polling during dead time.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: