MergeKit: Advancing AI Model Training for Enhanced Language Models
Why train when you can merge?
Wouldn’t it be amazing if you could acquire new knowledge just by uploading it into your brain?
With Mergekit, that might be possible although not for you but for your co-pilot. We might have to wait for Neuralink to go live with their product to reach sync integration with your artificial intelligence.
Overview
I could observe in my previous post that Mixture of Expert models can perform better compared to “single brain” models. While pre-trained large language models have emerged as powerful tools capable of performing a wide range of tasks that can be finetuned further, it might appear as if highly specialized models in a Mixture of Expert settings perform better.
Now a popular trend in LLM design is to use more and more parameters, with GPT-4 even having more than one Trillion. I am convinced, however that the sheer size and complexity of increasing parameters will be an insurmountable challenge for not only hackers but also researchers who can’t find the hardware at affordable pricing to work with them.
This is where Mergekit comes in, a toolkit designed to simplify and streamline the process of creating Mixture of Expert models by simply merging LLMs.
Mergekit is a free Gihub project that aims to be able to create merges of pre-trained models that “can be run entirely on CPU or accelerated with as little as 8 GB of VRAM. Many merging algorithms are supported”
Problems
What I have noticed so far when working with mergekit is:
LLM Heterogeneity: LLMs from different families often exhibit different architectures and parameterizations, making it difficult to integrate them seamlessly. Currently debugging this error message
ValueError: Trying to set a tensor of shape torch.Size([32003, 4096]) in "weight" (which has shape torch.Size([32001, 4096])), this look incorrect.
Scalability: I noticed that while the merging process in itself is quite straightforward running inference on these merges can be computationally expensive and memory-intensive. For some reason, the merged model only uses one GPU and outsources a lot of work to the CPU.
Performance Trade-offs: Different merging techniques may lead to varying performance outcomes, making it challenging to select the optimal approach for a given application. One of the prompts I used in my previous article took about 6 hours to complete. Obviously, this doesn’t really work in production.
Solution
What I liked about Mergekit