Introducing MPT-7B, the latest entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k. Starting today, you can train, finetune, and deploy your own private MPT models, either starting from one of our checkpoints or training from scratch. For inspiration, we are also releasing three finetuned models in addition to the base MPT-7B: MPT-7B-Instruct, MPT-7B-Chat, and MPT-7B-StoryWriter-65k+, the last of which uses a context length of 65k tokens!
While our models still hold a slight edge in terms of quality, the gap is closing astonishingly quickly. Open-source models are faster, more customizable, more private, and pound-for-pound more capable. They are doing things with $100 and 13B params that we struggle with at $10M and 540B. And they are doing so in weeks, not months. This has profound implications for us:
We have no secret sauce. Our best hope is to learn from and collaborate with what others are doing outside Google. We should prioritize enabling 3P integrations.
People will not pay for a restricted model when free, unrestricted alternatives are comparable in quality. We should consider where our value add really is.
Giant models are slowing us down. In the long run, the best models are the ones
which can be iterated upon quickly. We should make small variants more than an afterthought, now that we know what is possible in the <20B parameter regime.
At the beginning of March the open source community got their hands on their first really capable foundation model, as Meta’s LLaMA was leaked to the public. It had no instruction or conversation tuning, and no RLHF. Nonetheless, the community immediately understood the significance of what they had been given.
Most importantly, they have solved the scaling problem to the extent that anyone can tinker. Many of the new ideas are from ordinary people. The barrier to entry for training and experimentation has dropped from the total output of a major research organization to one person, an evening, and a beefy laptop.
Why We Could Have Seen It Coming
In many ways, this shouldn’t be a surprise to anyone. The current renaissance in open source LLMs comes hot on the heels of a renaissance in image generation. The similarities are not lost on the community, with many calling this the “Stable Diffusion moment” for LLMs.
In both cases, low-cost public involvement was enabled by a vastly cheaper mechanism for fine tuning called low rank adaptation, or LoRA, combined with a significant breakthrough in scale (latent diffusion for image synthesis, Chinchilla for LLMs). In both cases, access to a sufficiently high-quality model kicked off a flurry of ideas and iteration from individuals and institutions around the world. In both cases, this quickly outpaced the large players.
The effect was palpable: rapid domination in terms of cultural impact vs the OpenAI solution, which became increasingly irrelevant. Whether the same thing will happen for LLMs remains to be seen, but the broad structural elements are the same.
What We Missed
The innovations that powered open source’s recent successes directly solve problems we’re still struggling with. Paying more attention to their work could help us to avoid reinventing the wheel.
LoRA is an incredibly powerful technique we should probably be paying more attention to
LoRA works by representing model updates as low-rank factorizations, which reduces the size of the update matrices by a factor of up to several thousand. This allows model fine-tuning at a fraction of the cost and time. Being able to personalize a language model in a few hours on consumer hardware is a big deal, particularly for aspirations that involve incorporating new and diverse knowledge in near real-time. The fact that this technology exists is underexploited inside Google, even though it directly impacts some of our most ambitious projects.
Retraining models from scratch is the hard path
Part of what makes LoRA so effective is that – like other forms of fine-tuning – it’s stackable. Improvements like instruction tuning can be applied and then leveraged as other contributors add on dialogue, or reasoning, or tool use. While the individual fine tunings are low rank, their sum need not be, allowing full-rank updates to the model to accumulate over time.
This means that as new and better datasets and tasks become available, the model can be cheaply kept up to date, without ever having to pay the cost of a full run.
By contrast, training giant models from scratch not only throws away the pretraining, but also any iterative improvements that have been made on top. In the open source world, it doesn’t take long before these improvements dominate, making a full retrain extremely costly.
We should be thoughtful about whether each new application or idea really needs a whole new model. If we really do have major architectural improvements that preclude directly reusing model weights, then we should invest in more aggressive forms of distillation that allow us to retain as much of the previous generation’s capabilities as possible.
Large models aren’t more capable in the long run if we can iterate faster on small models
LoRA updates are very cheap to produce (~$100) for the most popular model sizes. This means that almost anyone with an idea can generate one and distribute it. Training times under a day are the norm. At that pace, it doesn’t take long before the cumulative effect of all of these fine-tunings overcomes starting off at a size disadvantage. Indeed, in terms of engineer-hours, the pace of improvement from these models vastly outstrips what we can do with our largest variants, and the best are already largely indistinguishable from ChatGPT. Focusing on maintaining some of the largest models on the planet actually puts us at a disadvantage.
Data quality scales better than data size
Many of these projects are saving time by training on small, highly curated datasets. This suggests there is some flexibility in data scaling laws. The existence of such datasets follows from the line of thinking in Data Doesn’t Do What You Think, and they are rapidly becoming the standard way to do training outside Google. These datasets are built using synthetic methods (e.g. filtering the best responses from an existing model) and scavenging from other projects, neither of which is dominant at Google. Fortunately, these high quality datasets are open source, so they are free to use.
Directly Competing With Open Source Is a Losing Proposition
This recent progress has direct, immediate implications for our business strategy. Who would pay for a Google product with usage restrictions if there is a free, high quality alternative without them?
Keeping our technology secret was always a tenuous proposition. Google researchers are leaving for other companies on a regular cadence, so we can assume they know everything we know, and will continue to for as long as that pipeline is open.
But holding on to a competitive advantage in technology becomes even harder now that cutting edge research in LLMs is affordable. Research institutions all over the world are building on each other’s work, exploring the solution space in a breadth-first way that far outstrips our own capacity. We can try to hold tightly to our secrets while outside innovation dilutes their value, or we can try to learn from each other.
Individuals are not constrained by licenses to the same degree as corporations
Much of this innovation is happening on top of the leaked model weights from Meta. While this will inevitably change as truly open models get better, the point is that they don’t have to wait. The legal cover afforded by “personal use” and the impracticality of prosecuting individuals means that individuals are getting access to these technologies while they are hot.
Being your own customer means you understand the use case
Browsing through the models that people are creating in the image generation space, there is a vast outpouring of creativity, from anime generators to HDR landscapes. These models are used and created by people who are deeply immersed in their particular subgenre, lending a depth of knowledge and empathy we cannot hope to match.
Owning the Ecosystem: Letting Open Source Work for Us
Paradoxically, the one clear winner in all of this is Meta. Because the leaked model was theirs, they have effectively garnered an entire planet’s worth of free labor. Since most open source innovation is happening on top of their architecture, there is nothing stopping them from directly incorporating it into their products.
The value of owning the ecosystem cannot be overstated. Google itself has successfully used this paradigm in its open source offerings, like Chrome and Android. By owning the platform where innovation happens, Google cements itself as a thought leader and direction-setter, earning the ability to shape the narrative on ideas that are larger than itself.
The more tightly we control our models, the more attractive we make open alternatives. Google and OpenAI have both gravitated defensively toward release patterns that allow them to retain tight control over how their models are used. But this control is a fiction. Anyone seeking to use LLMs for unsanctioned purposes can simply take their pick of the freely available models.
Google should establish itself a leader in the open source community, taking the lead by cooperating with, rather than ignoring, the broader conversation. This probably means taking some uncomfortable steps, like publishing the model weights for small ULM variants. This necessarily means relinquishing some control over our models. But this compromise is inevitable. We cannot hope to both drive innovation and control it.
Epilogue: What about OpenAI?
All this talk of open source can feel unfair given OpenAI’s current closed policy. Why do we have to share, if they won’t? But the fact of the matter is, we are already sharing everything with them in the form of the steady flow of poached senior researchers. Until we stem that tide, secrecy is a moot point.
And in the end, OpenAI doesn’t matter. They are making the same mistakes we are in their posture relative to open source, and their ability to maintain an edge is necessarily in question. Open source alternatives can and will eventually eclipse them unless they change their stance. In this respect, at least, we can make the first move.
According to the list of global systemically important banks (Wikipedia, Financial Stability Board (FSB), FSB PDF), JP Morgan Chase is top dog as the only Tier 4 bank. (The higher the tier, the more systemic risk the bank poses to the financial system so the required capital buffer is higher at each tier.)
A systemically important financial institution (SIFI) is a bank, insurance company, or other financial institution whose failure might trigger a financial crisis. They are colloquially referred to as “too big to fail“. [Wikipedia]
According to the Bank for International Settlements (BIS), which has a dashboard showing scores and components for Global Systemically Important Banks (GSIBs), JP Morgan Chase is by itself in Tier 4 with the highest overall risk rating as the most interconnected bank with the most complex banking relationships.
IfJP Morgan Chase were to fail, the financial system would be at high risk of a financial crisis.Which means JPM Chase is in an interesting position because the global financial system is both incentivized to keep JPM from failing and, if an institution is to fail, putting the most complex and interconnected financial institution at riskmaximizes the likelihood of another bailout.
Some of you may remember from 2 years ago (April 15, 2021) that JP Morgan sold $13B in bonds in the largest bank deal ever at the time (SuperStonk, Bloomberg) to raise money. The next day, Bank of America takes the lead by selling $15B worth of bonds (SuperStonk DD, Bloomberg, April 16, 2021).
So if JP Morgan needed to raise some serious money without getting a bailout, buying another bank in a sweetheart deal seems like another way to juice up JP Morgan’s balance sheet with some good PR. According to CNN Business,
First Republic … had assets of $229.1 billion as of April 13. As of the end of last year, it was the nation’s 14th-largest bank, according to a ranking by the Federal Reserve. JPMorgan Chase is the largest bank in the United States with total global assets of nearly $4 trillion as of March 31.
Now that JP Morgan picked up First Republic, their total assets increases by about $229B (about 5.7%). And, according to Reuters [Factbox], JP Morgan just got a pretty sweet deal with First Republic Bank:
JPMorgan Chase will pay $10.6 billion to the Federal Deposit Insurance Corp (FDIC)
Will not assume First Republic’s corporate debt or preferred stock
The FDIC absorbs a portion of the loss on assets sold through resolving a failed bank “sharing the loss with the purchaser of the failed bank”. Sounds like the FDIC just took one for the team.
According to the FDIC, loss sharing is basically an 80/20 split (except for after the 2008 Great Financial Crisis when the split was 95/5, which has ended).
According to the FDIC, resolving a failed bank with loss sharing is supposed to be based on the least costly option (to the Deposit Insurance Fund). (We’ve seen this least costly option come up in resolving bank failures before with the FDIC and Federal Reserve contemplating requiring Too Big To Fail banks sell destined-to-fail bonds to absorb losses and reduce payouts by the FDIC Deposit Insurance Fund [SuperStonk])
The FDIC will take a $13 billion hit to its fund and provide $50 billion in financing.
Wait, the FDIC is providing $50B to finance JPM buying FRC?!
The FDIC loaned JP Morgan $50B to buy the failed First Republic bank for $10.6B. $30B of that was used to repay a rescue deal from March (last month) backed by JP Morgan, Citigroup, Bank of America, and Wells Fargo. Which means JP Morgan gets their money back from the previous rescue plus an extra $9.4B out of this loan deal to buy $229B worth of assets from First Republic.
On top of that, JP Morgan splits losses with the FDIC 80/20 with the FDIC covering 80% of loan losses for the next 5-7 years (5 years for commercial loans and 7 years for residential mortgages).
Imagine if a bankloans you$9,400 to buy a $229,000 house. No down payment. Just “here’s $9,400 and the keys to that $229,000 house”. Oh, and the bank will cover 80% of the cost for anything that breaks in the house for the next 5-7 years. This is an insane deal.
Which truly makes one wonder if this is a “not-a-bailout” bailout for JP Morgan, the only Tier 4 GSIB as the most interconnected bank with the most complex banking relationships and the highest overall systemic risk rating.
Are we going to see:
Fat bonuses at JP Morgan?
Followed by news about JP Morgan posing a systemic risk?
Followed by calls to bail out JP Morgan to save pensions?