Hacker Newsnew | past | comments | ask | show | jobs | submit | mallochio's commentslogin

+1 for mentioning this. Took it out for a test run. Really neat stuff.


Could you also give us some details about the software you specifically use in the pipeline, other than kubernetes/docker? Do you use any available (possibly open source)? tools for versioning and monitoring? Or are you building this all up from scratch for your needs?


We use Jenkins for our builds (and soon simple testcases), which then feeds into our system that's built from scratch. However, we are looking to productized the system as we had some discussions with other corporate partners who are interested in as well.

The system we've built in-house is fairly simple to keep development fast - versioning is manual (https://packaging.python.org/guides/hosting-your-own-index/) but no reason why you couldn't use a repository manager.

For monitoring, we essentially track any activity related to the model including inputs, outputs, timestamps, duration, etc. to a database and have JavaScript charts render. We might put this into Kafka but seems overkill at the moment and likely force us to hire an actual support team.


Interesting. Could you maybe expand on the tools that you utilize inside the pipeline for ETL, model creation, annotation and testing?


I'm running the front and backend of the consumer site on Heroku. The meat of pipeline is hosted on a DigitalOcean High CPU Droplet. I use ffmpeg to extract images from the provided videos. I store everything in Google Cloud Storage and create references to each photo in Firestore. I use Firebase to power for the image verifying/labeling app I built. Its a simple app that presents the viewer with the image and the label that it was given. If its not correct they enter the correct label. I use a cloud function to move the images into an exportable format for autoML once a new image threshold has been hit. Testing is me using it and seeing if it is correctly identifying the objects.


Thanks for the reply. Could you give some more insight into how and what tools you choose for the different sort of tasks (say NLP vs CV vs RL)? Also, how and why are different tools/pipelines better for production and product building?


How you parse and manage the inputs is significantly different between those types.

With NLP as one example, you need to determine when are you going to do tokenization? - aka break up the inputs into "tokens." So do you do this at ingest, in transit, at rest?

With CV you don't need to do tokenization at all (probably).

So the tools really come out of the use case and how/when you put them into the production chain.


This is great, thanks for the link. Could you expand on how this workflow be different/better than sticking to just something like TFX and tensorflow serving? Is it easier to use or more scalable?


It is pretty much the same as TFX - but with Spark for both DataPrep and Distributed HyperparamOpt/Training, and a Feature Store. Model serving is slightly more sophisticated than just TensorFlow Serving on Kubernetes. We support serving requests through the Hopsworks REST API to TFServering/Kubernetes. This gives us both access control (clients have a TLS cert to authenticate and authorize themselves) and we log all predictions to a Kafka topic. We are adding support to enrich feature vectors using the Feature Store in the serving API, not quite there yet.

We intend to support TFX as we already support Flink. Flink/Beam for Python 3 is needed for TFX, but it's not quite there yet, almost.

It will be interesting to see which one of Spark or Beam will become the horizontally scalable platform of choice for TensorFlow. (PyTorch people don't seem as interested, as they mostly come from a background of not wanting complexity).


Thanks a bunch for taking the effort to create this, and for making it open source.! The project looks amazing.

Could you maybe also explain what the target audience is? Are there any benefits for using Polyaxon in (solo) research projects on a cluster, or is it tailored towards production-ready environments at corporations?


I think the target audience, is individuals or small teams who want to have an organized workflow, immutable and reproducible experiments with an organized and easy way to access logs and outputs.

The platform also provides a lot of automation to schedule concurrent experiments.

There are a couple of things that need to be polished to be used, notes on experiments, notification of finished experiments, especially if you are running hundreds of experiments.

Depending on how organized you are, many times you will end up with experiments that you did not know how you started, having a platform that takes care of that could be beneficial.

If you have already a cluster, for running your experiments you will most probably end up ssh-ing to the machine to check which experiments finished, probably in a screen, and their results and logs, Polyaxon simplifies that part as well.


This is all the reasons why I want to use your tool.

Thanks for creating it and I hope one day soon to contribute to your project.


Can someone tell me how this is different from/improves over pachyderm?


Pachyderm is a system for the nouns; this is a system for the verbs.

Polyaxon makes it easy to schedule training on a Kubernetes cluster. The problem this solves is that machine learning engineers generally spend too long running their jobs in series, rather than parallel. Instead of running one thing and waiting for it to finish, it's both more efficient and better methodology to plan out the experiments and then run them all at once.

Pachyderm is more concerned with versioning and asset management. It's more like Git+Airflow.

Let's say your experiment depends on training word vectors from Common Crawl dumps. You need to download the dump, extract the text you want, and train your word vectors models. Pachyderm is all about the problem of caching the intermediate results of that ETL pipeline, and making sure that you don't lose track of like, which month of data was used to compute these vectors. Polyaxon is all about the problem of, there are so many ways to train the word vectors and use them downstream. You want to explore that space systematically, by scheduling and automatically evaluating a lot of the work in parallel.


I just want to add to the other comments that Polyaxon focuses on different aspects of machine learning reproducibility than Pachyderm, although Polyaxon will be providing a very simple pipelining abstraction to start experiments based on previous jobs, triggers, or schedules, or provide the possibility to run post-experiment jobs. It will not focus on data provenance the same way Pachyderm does. In fact, Polyaxon and Pachyderm could be used together.


I don't know pachyderm but it seems to me quite similar to storm to create data pipelines. Polyaxon is useful to train deep learning models in a cluster. I couldn't find any example of how to do it in pachyderm (there are examples with only one node).


I use Mailspring (https://github.com/Foundry376/Mailspring). Neat features, sleek UI and also pretty fast.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: