Hacker Newsnew | past | comments | ask | show | jobs | submit | covi's commentslogin

Take a look at SkyPilot. Good for running these batch workloads. You can use spot instances to save costs.


To massively increase the reliability to get GPUs, you can use something like SkyPilot (https://github.com/skypilot-org/skypilot) to fall back across regions, clouds, or GPU choices. E.g.,

$ sky launch --gpus H100

will fall back across GCP regions, AWS, your clusters, etc. There are options to say try either H100 or H200 or A100 or <insert>.

Essentially the way you deal with it is to increase the infra search space.


Related: https://skyplane.org/en/latest/ (mentioned in OP)

From what I know this idea underpins a few FAANG-level companies' data transfer systems. OP's value = a simple implementation of the idea that's OSS and applied to AI.


Congrats on the API launch (from SkyPilot)!


Thanks! We used SkyPilot (an open source cloud GPU worker management tool) to help out with both our small (single node) and large (many node) training runs.


If you want to use your own GPUs or cloud accounts but with a great dev experience, see SkyPilot.


Now just need a Waymo invite code :)



https://www.forbes.com/sites/alexkonrad/2023/07/13/ai-startu...

> Its revenue run rate has spiked this year and now sits at around $30 million to $50 million, three sources said — with one noting that it had more that tripled compared to the start of the year.


Part of the Vicuna team wrote a guide on finetuning Llama2: https://blog.skypilot.co/finetuning-llama2-operational-guide...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: