Hacker Newsnew | past | comments | ask | show | jobs | submit | hasperdi's commentslogin

It will not work either if you have developer mode enabled.

These things HSBC app does, I think it's overreaching


My country launched an identification app (https://mygov.be/) that does the same thing. I have no idea what they're trying to achieve. Security through obscurity? Trying to piss off power users?

I'm a developer and use adb and some dev settings daily. Annoying af to have to disable developer mode constantly.


It's fundamentally client-side security: the phone tells the server "no, I haven't been rooted" and the server believes it.

Any security system that relies on any form of client-side security is going to have other problems as well, since its designers haven't grasped this basic principle.


That used to be a core principle but might not be guaranteed anymore. Depending on the implementation it can be near impossible to bypass modern hardware backed security. As it should be!

The policy issue at this point is that users effectively aren't in control of their devices anymore.


I had to turn on developer mode just to reduce blur in Android 16. It's incredible that's locked behind a developer mode setting.

> It will not work either if you have developer mode enabled.

Many other banking apps in Singapore have this ridiculous restriction too, including Citibank.

The third-party "security framework" most of them use to pass audits is ridiculous.


I just setup Gmail backup the other day. Using getmail + cron. The emails get stored as maildir (1 mail = 1 file). It's incremental backup friendly

The problem with this approach, is that the model may forget to update the log... It usually happens when the context window >50% filled

I found this happen less often if the task is a part of the plan. It typically gets in a cycle habit of editing code and updating the doc

thanks, I'll keep that in mind

.. and not only those, but the baseline as well aka CLAUDE.md.. I've countless of times told it basics, in the same session without compacting etc etc

Not OP, but I do have a question. How did you do lead generation? What works?

Slavery as punishment is actually allowed by the constitution...

AMENDMENT XIII

Section 1. Neither slavery nor involuntary servitude, except as a punishment for crime whereof the party shall have been duly convicted, shall exist within the United States, or any place subject to their jurisdiction.

https://www.archives.gov/milestone-documents/13th-amendment


Just because it's legal doesn't mean it's ethical or moral, and there are enough examples of things in such categories.

That doesn't mean it's right, it means the constitution is wrong!

The constitution isn't a holy book, it's some opinions someone wrote down on paper. Some of them might be wrong.


The only way to get it to change might be to force it upon the middle and upper class.

Not just permitted, but actually widespread. If you’re imprisoned in Texas, Georgia, Arkansas, Alabama, and Mississippi, you are going to be doing unpaid forced labor which is slavery, and many of the prisons are privately owned.

Federal prisons pay roughly $0.12 to $0.40 per hour for regular jobs, which isn’t much better.

The hypocrisy of the US is breathtaking sometimes, and the current administration has the gall to criticise europe.


I bought a second‑hand Mac Studio Ultra M1 with 128 GB of RAM, intending to run an LLM locally for coding. Unfortunately, it's just way too slow.

For instance, an 4‑bit quantized model of GLM 4.6 runs very slowly on my Mac. It's not only about tokens per second speed but also input processing, tokenization, and prompt loading; it takes so much time that it's testing my patience. People often mention about the TPS numbers, but they neglect to mention the input loading times.


At 4 bits that model won't fit into 128GB so you're spilling over into swap which kills performance. I've gotten great results out of glm-4.5-air which is 4.5 distilled down to 110B params which can fit nicely at 8 bits or maybe 6 if you want a little more ram left over.

Correction, my GLM-4.6 models are not Q4, I can only run lower ones eg:

- https://huggingface.co/unsloth/GLM-4.6-GGUF/blob/main/GLM-4.... - 84GB, Q1 - https://huggingface.co/unsloth/GLM-4.6-REAP-268B-A32B-GGUF/t... - 92GB, Q2

I ensure that there are enough RAM leftover ie limited context window setting, so no swapping.

As for GLM-4.5-Air, I run that daily, switching between noctrex/GLM-4.5-Air-REAP-82B-A12B-MXFP4_MOE-GGUF and kldzj/gpt-oss-120b-heretic


Are you getting any agentic out of gpt-oss-120b?

I can't tell if it's some bug regarding message formats or if it's just genuinely giving up, but it failed to complete most tasks I gave it.


GPT-oss-120B was also completely failing for me, until someone on reddit pointed out that you need to pass back in the reasoning tokens when generating a response. One way to do this is described here:

https://openrouter.ai/docs/guides/best-practices/reasoning-t...

Once I did that it started functioning extremely well, and it's the main model I use for my homemade agents.

Many LLM libraries/services/frontends don't pass these reasoning tokens back to the model correctly, which is why people complain about this model so much. It also highlights the importance of rolling these things yourself and understanding what's going on under the hood, because there's so many broken implementations floating around.


IIRC I did and failed but I didn't investigate further.

I've been running the 'frontier' open-weight LLMs (mainly deepseek r1/v3) at home, and I find that they're best for asynchronous interactions. Give it a prompt and come back in 30-45 minutes to read the response. I've been running on a dual-socket 36-core Xeon with 768GB of RAM and it typically gets 1-2 tokens/sec. Great for research questions or coding prompts, not great for text auto-complete while programming.

Let's say 1.5tok/sec, and that your rig pulls 500 W. That's 10.8 tok/Wh, and assuming you pay, say 15c/kWh means you're paying in the vicinity of $13.8/mtok of output. Looking at R1 output costs on OpenRouter, it's costing about 5-7x as much as what you can pay for third party inference (which also produce tokens ~30x faster).

Given the cost of the system, how long would it take to be less expensive than, for example, a $200/mo Claude Max subscription with Opus running?

It's not really an apples-to-apples comparison - I enjoy playing around with LLMs, running different models, etc, and I place a relatively high premium on privacy. The computer itself was $2k about two years ago (and my employer reimbursed me for it), and 99% of my usage is for research questions which have relatively high output per input token. Using one for a coding assistant seems like it can run through a very high number of tokens with relatively few of them actually being used for anything. If I wanted a real-time coding assistant, I would probably be using something that fit in the 24GB of VRAM and would have very different cost/performance tradeoffs.

For what it is worth, I do the same thing you do with local models: I have a few scripts that build prompts from my directions and the contents of one or more local source files. I start a local run and get some exercise, then return later for the results.

I own my computer, it is energy efficient Apple Silicon, and it is fun and feels good to do practical work in a local environment and be able to switch to commercial APIs for more capable models and much faster inference when I am in a hurry or need better models.

Off topic, but: I cringe when I see social media posts of people running many simultaneous agentic coding systems and spending a fortune in money and environmental energy costs. Maybe I just have ancient memories from using assembler language 50 years ago to get maximum value from hardware but I still believe in getting maximum utilization from hardware and wanting to be at least the ‘majority partner’ in AI agentic enhanced coding sessions: save tokens by thinking more on my own and being more precise in what I ask for.


Never, local models are for hobby and (extreme) privacy concerns.

A less paranoid and much more economically efficient approach would be to just lease a server and run the models on that.


This.

I spent quite some time on r/LocalLLaMA and yet need to see a convincing "success story" of productively using local models to replace GPT/Claude etc.


I have several my own little success stories:

- For polishing Whisper speech to text output, so I can dictate things to my computer and get coherent sentences, or for shaping the dictation to specific format eg. "generate ffmpeg to convert mp4 video to flac with fade in and out, input file is myvideo.mp4 output is myaudio flac with pascal case" -> Whisper -> "generate ff mpeg to convert mp4 video to flak with fade in and out input file is my video mp4 output is my audio flak with pascal case" -> Local LLM -> "ffmpeg ..."

- Doing classification / selection type of work eg. classifying business leads based on the profile

Basically the win for local llm is that the running cost (in my case, second hand M1 Ultra) is so low, I can run large quantity of calls that don't need frontier models.


My comment was not very clear. I specifically meant Claude Code/Codex like workflows where the agent generates/run code interactively with user feedback. My impression is that consumer grade hardware is still too slow for these things to work.

You are right, consumer grade hardware is mostly too slow... although it's a relative thing right. For instance you can get Mac Studio Mx Ultra with 512GB RAM, run GLM-4.5-Air and have a bit of patience. It could work

I was able to run a batch job that lasted ~2 weeks of inference time on my m4 max by running it over night against a large dataset I wanted to mine. It cost me pennies in electricity and writing a simple python script as a scheduler.

Tokens will cost same on Mac and on API because electricity is not free

And you can only generate like $20 of tokens a month

Cloud tokens made on TPU will always be cheaper and waaay faster then anything you can make at home


This generally isn't true. Cloud vendors have to make back the cost of electricity and the cost of the GPUs. If you already bought the Mac for other purposes, also using it for LLM generation means your marginal cost is just the electricity.

Also, vendors need to make a profit! So tack a little extra on as well.

However, you're right that it will be much slower. Even just an 8xH100 can do 100+ tps for GLM-4.7 at FP8; no Mac can get anywhere close to that decode speed. And for long prompts (which are compute constrained) the difference will be even more stark.


A question on the 100+ tps - is this for short prompts? For large contexts that generate a chunk of tokens at context sizes at 120k+, I was seeing 30-50 - and that's with 95% KV cache hit rate. Am wondering if I'm simply doing something wrong here...

Depends on how well the speculator predicts your prompts, assuming you're using speculative decoding — weird prompts are slower, but e.g. TypeScript code diffs should be very fast. For SGLang, you also want to use a larger chunked prefill size and larger max batch sizes for CUDA graphs than the defaults IME.

It doesn't matter if you spend $200, $20,000, or $200,000 a month on an Anthropic Subscription.

None of them will keep your data truly private and offline.


Yes they conveniently forget about disclosing prompt processing time. There is an affordable answer to this, will be open sourcing the design and sw soon.

Have you tried Qwen3 Next 80B? It may run a lot faster, though I don't know how well it does coding tasks.

I did, it works well... although it is not good enough for agentic coding

Need the M5 (max/ultra next year) with it's MATMUL instruction set that massively speeds up the prompt processing.

Anything except a 3bit quant of GLM 4.6 will exceed those 128 GB of RAM you mentioned, so of course it's slow for you. If you want good speeds, you'll at least need to store the entire thing in memory.

Even LinkedIn is now down. Opening linkedin.com gives me a 500 server error and Cloudflare at the bottom. Quite embarassing.


At least they were available when Front Door was down!


Sure Bun has its benefits, but I don't see the strategic reasons why Anthropic is doing this


Apparently Claude Code being built on Bun was considered a good enough reason? But it looks more strategic for Bun since they’re VC-backed and get a good exit:

> Claude Code ships as a Bun executable to millions of users. If Bun breaks, Claude Code breaks. Anthropic has direct incentive to keep Bun excellent.

> Bun's single-file executables turned out to be perfect for distributing CLI tools. You can compile any JavaScript project into a self-contained binary—runs anywhere, even if the user doesn't have Bun or Node installed. Works with native addons. Fast startup. Easy to distribute.

> Claude Code, FactoryAI, OpenCode, and others are all built with Bun.

> Over the last several months, the GitHub username with the most merged PRs in Bun's repo is now a Claude Code bot. We have it set up in our internal Discord and we mostly use it to help fix bugs. It opens PRs with tests that fail in the earlier system-installed version of Bun before the fix and pass in the fixed debug build of Bun. It responds to review comments. It does the whole thing.

> This feels approximately a few months ahead of where things are going. Certainly not years.

> We've been prioritizing issues from the Claude Code team for several months now. I have so many ideas all the time and it's really fun. Many of these ideas also help other AI coding products.

> Instead of putting our users & community through "Bun, the VC-backed startups tries to figure out monetization" – thanks to Anthropic, we can skip that chapter entirely and focus on building the best JavaScript tooling.

https://bun.com/blog/bun-joins-anthropic


Turn every potentially useful development tool into some LLM hype bullshit to grow the bubble.


Same thought and I can't wait for a video with a very confused Theo...

I mean, it's likely very important for them to have a fast and sandboxed code executor available. But it's not like Bun would fight against improvements there or refuse paid work on specific areas, right?


and can be faster if you can get an MOE model of that


"Mixture-of-experts", AKA "running several small models and activating only a few at a time". Thanks for introducing me to that concept. Fascinating.

(commentary: things are really moving too fast for the layperson to keep up)


As pointed out by a sibling comment. MOE consists of a router and a number of experts (eg 8). These experts can be imagined as parts of the brain with specialization, although in reality they probably don't work exactly like that. These aren't separate models, they are components of a single large model.

Typically, input gets routed to a number of of experts eg. top 2, leaving the others inactive. This reduces number of activation / processing requirements.

Mistral is an example of a model that's designed like this. Clever people created converters to transform dense models to MOE models. These days many popular models are also available in MOE configuration


that's not really a good summary of what MoEs are. you can more consider it like sublayers that get routed through (like how the brain only lights up certain pathways) rather than actual separate models.


The gains from MoE is that you can have a large model that's efficient, it lets you decouple #params and computation cost. I don't see how anthropomorphizing MoE <-> brain affords insight deeper than 'less activity means less energy used'. These are totally different systems, IMO this shallow comparison muddies the water and does a disservice to each field of study. There's been loads of research showing there's redundancy in MoE models, ie cerebras has a paper[1] where they selectively prune half the experts with minimal loss across domains -- I'm not sure you could disable half the brain and notice a stupefying difference.

[1] https://www.cerebras.ai/blog/reap


> I don't see how anthropomorphizing MoE <-> brain affords insight deeper than 'less activity means less energy used'.

I'm not saying it is a perfect analogy, but it is by far the most familiar one for people to describe what sparse activation means. I'm no big fan of over-reliance on biological metaphor in this field, but I think this is skewing a bit on the pedantic side.

re: your second comment about pruning, not to get in the weeds but I think there have been a few unique cases where people did lose some of their brain and the brain essentially routed around it.


All modern models are MoE already, no?


That's not the case. Some are dense and some are hybrid.

MOE is not the holy grail, as there are drawbacks eg. less consistency, expert under/over-use


>90% of inference hardware is faster if you run an MOE model.


Deepseek is already a MoE


With quantization, converting it to an MOE model... it can be a fast walk


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: