More

covi · 2025-06-05T22:18:46 1749161926

Take a look at SkyPilot. Good for running these batch workloads. You can use spot instances to save costs.

covi · 2025-06-04T18:04:07 1749060247

To massively increase the reliability to get GPUs, you can use something like SkyPilot (https://github.com/skypilot-org/skypilot) to fall back across regions, clouds, or GPU choices. E.g.,

$ sky launch --gpus H100

will fall back across GCP regions, AWS, your clusters, etc. There are options to say try either H100 or H200 or A100 or <insert>.

Essentially the way you deal with it is to increase the infra search space.

covi · 2025-04-24T18:19:56 1745518796

Related: https://skyplane.org/en/latest/ (mentioned in OP)

From what I know this idea underpins a few FAANG-level companies' data transfer systems. OP's value = a simple implementation of the idea that's OSS and applied to AI.

covi · 2025-03-03T21:08:45 1741036125

Congrats on the API launch (from SkyPilot)!

zaptrem · 2025-03-03T21:19:44 1741036784

Thanks! We used SkyPilot (an open source cloud GPU worker management tool) to help out with both our small (single node) and large (many node) training runs.

covi · on Aug 9, 2024

If you want to use your own GPUs or cloud accounts but with a great dev experience, see SkyPilot.

covi · on Oct 25, 2023

Now just need a Waymo invite code :)

covi · on Sept 27, 2023

Cloud deployment docs: https://docs.mistral.ai/cloud-deployment/skypilot/

covi · on Aug 24, 2023

https://www.forbes.com/sites/alexkonrad/2023/07/13/ai-startu...

> Its revenue run rate has spiked this year and now sits at around $30 million to $50 million, three sources said — with one noting that it had more that tripled compared to the start of the year.

covi · on Aug 3, 2023

Part of the Vicuna team wrote a guide on finetuning Llama2: https://blog.skypilot.co/finetuning-llama2-operational-guide...

covi · on Aug 2, 2023

Related ongoing thread: https://news.ycombinator.com/item?id=36975245