Replace a fine-tuned text-davinci-003 with Mixtral 8x7B

Leonardo is overpowered by a strong Mistral wind

In a few short hours, OpenAI will shut down access to a few old large language models, the text-davinci series. Everyone who depended for some project on a fine-tuned version of these models and who hasn’t found a replacement yet might soon feel some pain. I was in that cohort until I found out how straightforward it is to fine-tune a new generation of open-weight LLMs and how good the quality of inference is.

text-davinci-003 and Mixtral-8x7B are vastly different: the former had a capacity of 175 billion parameters and not much is known about its training method apart from the fact that it’s a transformer model. Mixtral-8x7B is also a transformer model, but uses a sparse Mixture of Expert architecture and its weights are published and licensed under Apache 2.0. It has 46.7B total parameters but uses only 12.9B parameters per token. This is all to say that you shouldn’t expect a “plug-and-play” sort of solution in terms of output, but I feel confident enough that with some prompt-fu anyone should be able to achieve the same quality of results they obtained with the old model, or better.

If you decide to try this out, the only thing you’ll need is 1) the training set you originally used with OpenAI; and 2) access to a GPU(s) with at least 45GB of RAM. I use Paperspace (recently acquired by DigitalOcean) and am very happy with their service and pricing. You only need a beefy GPU during training phase, because Mixtral is fast enough to produce between 2 and 3 tokens per second while running on a modern CPU, which could be good enough for a lot of projects that do not rely on real-time interactions. If that’s your case, you should expect to pay just a few dollars for training your model.

The training set

OpenAI required a jsonl file that looked like this:

{"prompt":"This is a prompt.\n","completion":" And this is a completion\n"}
{"prompt":"This is another prompt.\n","completion":" And this is another completion\n"}

For Mixtral, you’ll need to format your data this way:

{"version": "0.1.0",
 "data": [{"text": "<s>[INST] This is a prompt. [/INST] And this is a completion </s>"},
{"text": "<s>[INST] This is another prompt. [/INST] And this is another completion. </s>"}]

The code

Here’s the notebook I’ve worked on. You’ll find several similar examples for fine-tuning your model with Mixtral. Most of the code I used is a subset of the excellent notebook by brevdev. You might need to make some changes, for instance in case you do not have both a training and an evaluation set.

Using your fine-tuned model

As mentioned earlier, it’s unlikely that your model will work out of the box exactly as it did with text-davinci-003. In my next post, I will discuss a couple of strategies I use to enhance generation quality to a place far beyond what davinci-003 could output using both good-ol’ NLP and new RAG strategies. It’s important to understand that the task you are trying to accomplish will largely determine the methods and results. I will use as a test case the Infinite Conversation, which should make for some fun examples.

Post a Comment through Mastodon

If you have a Mastodon account, .

Post a Comment through WordPress

Your email address will not be published. Required fields are marked *

Name *