Show HN: GPTinker – AI developer sidekick that can work on existing codebases

GPTinker is an experimental AI junior developer, my weekend project (with code available at [1]) designed to test the limits of end-to-end coding using GPT-4.

TL;DR: Here's a 7-minute video showing the app in action: https://youtu.be/XgMKCeiUDQc

Although it has already been demonstrated that GPT models can write new code and even modify existing code based on a prompt, the most potential lies in interacting with existing codebases—an area that I believe has not yet been widely explored. The aim of this project is to see how far we can take this idea and gauge its usefulness. At present, GPTinker is a proof of concept that I've put together in just a few days. While it is currently constrained to Typescript repositories (although this could be easily changed), it can already do some pretty neat things:

* Modify existing components and write new ones based on current ones, while maintaining the app's style

* Refactor components by extracting parts of the component code into separate components and then integrating the new component where the old code used to be

* Make project-wide changes, such as performing a search and modifying all component invocations to support new props or changing their class names

* Write unit tests for components and business logic, run them, and fix any issues

* Debug and fix configuration issues, install missing dependencies, and prompt you for how to proceed when running a command fails

* Perform other tasks that I haven't yet thought of testing

GPTinker achieves all this by utilising ideas from Toolformer and LangChain. The model is asked to split the given task into a series of discrete actions and then execute them one by one using one of the built-in commands (tools), such as ListFiles, ReadFile, WriteFile, RunCommand, etc. It writes out JSON that gets parsed by the backend, which then executes the command with the given parameters and feeds the output back to the model. This process continues in a loop until the model decides the task is complete and doesn't emit any new commands. I've already noticed some shortcomings and limitations of this method and GPT-4 in general:

* The 8k token limit prevents editing larger files and having longer conversations. However, with the 32k model on the horizon and by implementing some optimizations, this limitation can likely be addressed

* Writing out larger files can be slow. This will undoubtedly improve over time (hoping for a "turbo" model soon), and if I manage to implement a PatchFile command so that the model emits only the diffs, it will speed things up and cut down on token usage

* Ability to reason about your code—while GPT-4 is generally good at understanding what your code does and how to apply changes to it, its performance can sometimes be hit or miss, depending on the tasks you throw at it. Think of it as a junior-level developer with lots of patience

* Being either too strict or too lax with given instructions. Often, it will do only the one thing you asked for, disregarding any potential consequences of doing so. Other times, it might not listen to parts of the instructions (e.g., while investigating the PatchFile command, I couldn't get it to reliably output changes in a diff format)

* Lack of knowledge beyond 2021: GPTinker is unaware of how to use certain libraries that have undergone significant changes since then or how to utilize new features of frameworks. This limitation could potentially be addressed by increasing the token limit and adding a command that allows the AI to access documentation on the internet.

I'm curious to hear what you guys think about this approach and how it could be improved. PRs are welcome! It might be a dead end, but hey, you won't know unless you try, right?

[1] https://github.com/maciej-trebacz/gptinker