🚀 Gate.io #Launchpad# for Puffverse (PFVS) is Live!
💎 Start with Just 1 $USDT — the More You Commit, The More #PFVS# You Receive!
Commit Now 👉 https://www.gate.io/launchpad/2300
⏰ Commitment Time: 03:00 AM, May 13th - 12:00 PM, May 16th (UTC)
💰 Total Allocation: 10,000,000 #PFVS#
⏳ Limited-Time Offer — Don’t Miss Out!
Learn More: https://www.gate.io/article/44878
#GateioLaunchpad# #GameeFi#
Open Source Race: AI Big Model's "Linux Moment" Arrives
Written by: Song Jiaji, Sun Shuang
Shortly after the release of ChatGPT, Meta open sourced the GPT-like large language model LLaMA. Since then, several large models such as Alpaca, Vicuna, and Koala have been born. They have achieved impressive performance at a model scale and cost much lower than ChatGPT. It has caused industry insiders to worry that "Neither Google nor OpenAI has a moat, and the threshold of large models is being broken by open source, and they will be replaced if they do not cooperate." The capital market is also paying attention to the future competition pattern of large models. Does a small model no longer require a lot of computing power? What role does data play in it? ...This report attempts to analyze the common ground of this wave of open source large language models, review the development history of the open source benchmark Linux, and answer these questions.
**Common point 1: Start with open source. **Open source ≠ free. The open source business model includes at least: 1. Monetization through services. One example is Red Hat, a Linux enterprise service company that was once listed and later acquired by IBM. Enterprises are willing to pay for more stable and timely technical support. 2. Realize by licensing fees. Android is open source, but Google charges license fees to manufacturers in the EU for using the Android Google suite. 3. The development of licenses, standards, and capability evaluation systems is a catalyst for deepening the commercialization of open source large models. The license agreements adopted by this wave of open source large models are mainly Apache 2.0 and MIT, which do not prohibit commercial use, and do not prohibit users from modifying the model and then closing the source, which helps companies apply such large models.
** Common point 2: less parameters and miniaturization. **Compared to the GPT3+ ultra-large model with 100 billion parameters, the parameters of this wave of open source large models are generally at the level of one billion to ten billion. At present, there is no systematic large-scale model performance evaluation system, and only some tasks have scoring standards with strong credibility. Among the large open source models, Vicuna is also more capable, and can achieve 92% GPT4 performance in some tasks. Generally speaking, the OpenAI GPT system is still the best, but the training cost is high and it is difficult to reproduce. The open source large model achieves low training cost and high performance with the help of larger identifier training data sets, DeepSpeed, RLHF, etc., and the barriers to large models below super large models are disappearing.
** Common point 3: The data set attaches great importance to human instructions and is commercially available. **An important factor for ChatGPT's substantial improvement compared to GPT3 is the use of RLHF (reinforcement learning based on human feedback), that is, in training, using human-generated answers and sorting AI-generated content to allow AI to "align" human preference. LLaMA does not use instruction fine-tuning, but a large number of large models after LLaMA use and open source instruction data sets, and gradually explore self-built instruction data sets instead of using OpenAI with commercial restrictions, which further reduces the threshold for reproducing GPT and expands commercial availability.
**How to look at the open source big model next? **Standing in the wave of open source large models, we have noticed two trends: 1) Integration with multimodality, Tsinghua University's VisualGLM-6B is a multimodal upgraded version of the famous open source language model ChatGLM, we believe that it can The feature of local deployment based on consumer-grade graphics cards is the general trend. 2) The open source model + edge computing promotes the commercialization of AI. Harbin University's Chinese medical consultation model "Huatuo" and its use in cross-border e-commerce are examples.
Investment Suggestion: We believe that views on large models should be viewed in time and in layers. 1. In the short term, OpenAI’s GPT-based super-large model still surpasses other large-scale open-source models. Therefore, we should focus on Microsoft, which has in-depth cooperation with it in equity and products, Apple, which can obtain ChatGPTiosApp’s revenue share, and computing power service providers of super-large models. Nvidia, etc.; 2. In the medium and long term, if the capabilities of some open source large models are further verified, applications will be rolled out quickly, and large models will form a positive cycle for computing power; 3. Others: edge computing power, big data companies and open source The large-scale service business model is also worthy of attention. Suggested attention: 1) Optical module service providers: Zhongji InnoLight, Xinyisheng, Tianfu Communication, Yuanjie Technology; 2) Smart module service providers: MeiG Smart, Fibocom; 3) Edge IDC service providers: Longyu shares, Wangsu Technology; 4) AIoT communication chip and equipment manufacturers: ZTE, Tsinghua Unigroup, Ruijie Networks, Feiling Kesi, Fii, Aojie Technology, Chuling Information; 5) Application terminal labels: Yingying Network, Shenzhou Taiyue, Jiaxun Feihong, Zhongke Jincai, etc.
**Risk reminder: ethical risk, market competition risk, policy and legal supervision risk. **
I. Introduction
A report sparked intense public interest in open source large language models.
1.1 "Neither Google nor OpenAI has a moat, and the threshold of large models is being broken by open source"
** "Unless Google and OpenAI change their attitudes and choose to cooperate with the open source community, they will be replaced by the latter", **According to Bloomberg and SemiAnalysis reports, in early April, Google engineer Luke Sernau stated that in the artificial intelligence big language model ( Large Language Models, LLM, hereinafter referred to as "large model") track, Google and OpenAI, the launcher of ChatGPT, have no moat, and the open source community is winning the race.
This argument brought the public's attention to the climax of the phenomenon of "a large number of large models appeared after Meta open source large model LLaMA at the beginning of the year". Among the three key elements of "model", "computing power" and "data", what is the future competition pattern of large models? Will a large amount of computing power be no longer needed if the model is small? What role does data play in it? ...This report attempts to analyze the common ground of this wave of open source large models, review the development history of open source benchmark Linux, answer the above questions, and look forward to the future of large models.
1.2 Open source large-scale models appear intensively, which can be called a trend
On February 24, Meta released the LLaMA open source large model. Since then, a number of large models have emerged in the market, which can be roughly divided into three categories.
1.2.1 "LLaMA series": good performance, but low degree of commercialization
LLaMA includes four different parameter versions (7 billion/13 billion/33 billion/65 billion), not commercially available, the command dataset is based on OpenAI, and the model performance can be equal to or better than GPT-3**. **Among them, the 7 billion and 13 billion parameter versions have a pre-training data set containing 1 trillion tokens; the 33 billion and 65 billion parameter versions have a pre-training data set containing 1.4 trillion tokens. In comparison with GPT-3, the LLaMA-7 billion parameter version performs at the same level as GPT-3 on commonsense reasoning tasks, zero-shot tasks, natural questions, and reading comprehension, while the version model with 13 billion parameters and higher The performance in the above fields is better than GPT-3.
The LLaMA model itself does not use the instruction data set, but considering that ChatGPT, which is better than GPT-3, uses the human instruction data set, a batch of open source large models use the OpenAI instruction data set to optimize the performance of the model based on the LLaMA model. Includes Alpaca, GPT4All, Vicuna, Koala, Open Assistant, and Hugging Chat. Since the OpenAI instruction data set is not commercially available, these LLaMA-based open source large models are also not commercially available.
1.2.2 Dolly2.0, RedPajama, StableLM, etc.: high degree of commercialization
These large models do not use the OpenAI instruction dataset, so they are commercially available, but most are still under continuous development.
1.2.3 Chinese Gemini: ChatGLM-6B and MOSS
ChatGLM-6B and MOSS were launched by relevant research groups of Tsinghua University and Fudan University respectively, and are well-known in the Chinese community.
The models also share some commonalities, which the report details below.
Second, common point 1: starting from open source
** In this wave, whether it is the model itself or the data set used by the model, the first thing they have in common is "open source". **
**2.1 Why open source? **
The important question for the market to open source large models is why it should be open sourced, and whether this will damage the business model of the large model industry. We sorted out the self-reports of some large models on the reasons for open source, and summarized them as follows.
2.1.1 Model Perspective: Prevent monopoly of large companies and break commercial prohibition restrictions
In order to democratize artificial intelligence research, bridge the quality gap between open and closed models, and remove commercial prohibition restrictions, the vigorous development of open source large models is expected to promote the above goals.
2.1.2 Data Perspective: Protect corporate secrets and make customized data training possible
**Guarantee data privacy and allow enterprises to customize development. **For many industries, data is the lifeblood of enterprises. The open source of large models enables enterprises to train their own data sets on large models, while achieving data control and protecting enterprise data privacy. At the same time, the open source large model allows enterprise developers to carry out customized development on the basis of the model, target training data, and filter certain topics, reducing the size of the model and the training cost of the data.
2.1.3 Perspective of computing power: reduce the cost of computing power and make the use of large models "inclusive"
**The open source large model saves computing power consumption in the training phase, reduces computing power costs for enterprises, and promotes the "inclusive" use of large models. **Total computing power requirement = number of scenarios* computing power requirement for a single scenario. In the training and use of large models, computing power consumption is divided into two scenarios, namely training cost consumption and inference cost consumption.
**2.2 Open source, what kind of soil do you need? **
**The flourishing of open source megamodels is not without precedent, and Linux, the world's largest open source software project, has a similar story. **Researching the development history of Linux has reference significance for looking forward to the future of the open source large model.
2.2.1 Let’s start from the open source benchmark Linux
**Linux is a free and open source operating system released under the GNU General Public License (GPL). **Anyone can run, study, share and modify this software. Modified code can also be redistributed and even sold, but only under the same license. Traditional operating systems, such as Unix and Windows, are proprietary systems that are vendor-locked, delivered as-is, and cannot be modified.
Many of the world's largest industries and businesses rely on Linux. Today, Linux is everywhere, from knowledge-sharing sites like Wikipedia, to the New York Stock Exchange, to mobile devices running Android, a dedicated distribution of the Linux kernel that includes free software. Today, Linux is not only the most commonly used operating system on public Internet servers, but also the only operating system used on the top 500 fastest supercomputers.
**In the server market, the market share of Linux has far surpassed that of the "grandfather" of the operating system, Unix, and the "Linux moment" happened. **Taking the Chinese market as an example, according to the data of CCID Consulting and the statistics of installed capacity, in terms of server architecture, Linux is the mainstream in the market, occupying an absolute leading position, with a market share of 79.1%. Windows' market share fell to 20.1%, and Unix's market share was only 0.8%.
2.2.2 Linux is not a work of its own, leveraging on the history of open source behind the community
Unix has been open source, providing fire for Linux
**Unix, the originator of the modern operating system. **The operating system refers to the software that directly manages system hardware and resources (such as CPU, memory, and storage space). It is located between applications and hardware and is responsible for establishing connections between all software and related physical resources. Unix is considered by many to be the ancestor of modern operating systems.
**Unix was once open source. **The world's first general-purpose computer was born in 1946, and Unix was developed in 1969. For ten years, AT&T, the owner of UNIX, has licensed the Unix source code to academic institutions for research or teaching with low-cost or even free licenses. Many institutions have expanded and improved this source code, forming the so-called "Unix variants". Later, AT&T realized the commercial value of Unix, no longer licensed Unix source code to academic institutions, and declared copyright rights to the previous Unix and its variants
Unix is too expensive after returning to closed source, which led to the development of Linux
Linux was designed and launched by Linux Torvalds in 1991. At that time, he was still in college and thought that the popular commercial operating system Unix at that time was too expensive, so he developed Linux based on the Unix-like operating system Minix and opened it to people like himself who could not afford it. from the team.
Minix for teaching only, inspired the development of Linux
After AT&T privatized the source code, Tanenbaum, a professor at Vrije University Amsterdam in the Netherlands, decided to develop a UNIX-compatible homework without using any AT&T source code in order to teach students the practical details of operating system operations in class. system to avoid copyright disputes. He called it MINIX with the meaning of mini-UNIX (mini-UNIX). The first version of MINIX was released in 1987, and you only need to buy its disk to use it. Before the Linux system did not have its own native file system, the file system of Minix was used.
Open source community, license and standard support
** Open source from the start. **In August 1991, Linux founder Linus Torvalds posted Linux to the Minix Usenet newsgroup. Then he released Linux to the FTP site because he wanted more people to develop the kernel together.
**The license helps the ecology to flourish and flourish. **Linux is based on the GNU GPL license (GNU's Not Unix General Public License, Genu Project General Public License) model. The GPL license grants the four freedoms that "free software" grants users, or "Copyleft (public copyright)":
The GPL license requires that derivative works of the GPL program also follow the GPL license model. In contrast, licenses such as BSD-style do not prohibit derivative works from being made into proprietary software. The GPL is the most popular license for free and open source software. Complying with the GPL license enables the Linux ecosystem to continue to thrive, so as not to enter a "dead end" where it cannot continue to develop.
**Standards internally make the ecology "shape scattered but spirit not scattered", and internally embrace the "giant whale". **
**2.3 Open source, how to make money? **
The core question in the market about "open source" is the business model. "Open source" itself is free, but "open source" is the soil, and the "open source community" has bred various business models, which can be learned from the Linux ecosystem.
2.3.1 Red Hat: Service First
Red Hat is a leader in the Linux ecosystem. More than 90% of the Fortune 500 companies trust Red Hat. Red Hat has huge commercial value as a company. In 1993, Red Hat was established. In 1999, Red Hat was listed on Nasdaq. According to the Red Hat prospectus, citing IDC data, as of 1998, 56% of all authorized new installations of the Linux operating system came from Red Hat. In 2012, Red Hat became the first open source technology company with more than $1 billion in revenue; in 2019, IBM acquired Red Hat for approximately $34 billion.
Regarding the business model of Linux and Red Hat, it is like the analogy of Curiosity Daily. In a sense, the open source Linux kernel is like a free and open recipe, and Red Hat is like a restaurant. People are still willing to go to the restaurant to taste processed dishes. and enjoy attentive service. Red Hat provides Linux operating systems and subscription services for enterprises. The main services include: 1. 24*7 technical support; 2. Cooperating with upstream communities and hardware manufacturers to support a wide range of hardware architectures, such as x86, ARM, IBM Power etc.; 3. Continuous vulnerability alerts, directional guidance, and automatic repair services; 4. Deployment across multiple clouds; 5. Security protection functions such as real-time kernel patching and security standard certification; 6. Detect performance anomalies and build a comprehensive view of system performance , and apply preset tuning profiles, etc.
2.3.2 Android system (Android): backed by Google, monetized by advertising
According to Statcounter data, as of April 2023, Android (Android) is the world's number one mobile phone operating system, with a market share of 69%, far exceeding the second place (iOS, 31%). Android is developed based on the Linux kernel and was acquired by Google in 2005. Subsequently, Google released the source code of Android under the authorization of Apache's free open source license, enabling manufacturers to quickly launch Android-equipped smartphones, which accelerated the popularity of Android.
As for the business model, many services pre-installed on Android phones are provided by Google’s proprietary products, such as maps, Google Play app store, search, and Google Mail (Gmail). Therefore, although Android is free and open source, Google can still use its The mobile market "sieges cities and territories" and monetizes user traffic.
Google also charges licensing fees directly from Android mobile phone manufacturers. Starting from October 29, 2018, EU manufacturers using Android-based mobile phones and tablets must pay Google a licensing fee for each device. Equipment can cost up to $40.
2.4 The mainstream license of the open source large model supports commercial use
The open source community already has well-known licenses such as GPL, BSD, and Apache. In terms of large models, we have noticed that LLaMA, which was released in February 2023 and led the wave of open source large models, is prohibited for commercial use and can only be used for research. MetaAI will be awarded to civil servants, members of social groups, academic personnel and industry research experiments according to specific circumstances. room, access to the model. Among them, the reasoning code of LLaMA is based on the GPL3.0 license, which means: 1) After others modify the reasoning code of LLaMA, the source cannot be closed; 2) The new code must also adopt the GPL license. However, we noticed that some developers have developed variant models based on LLaMA with different types of licenses. For example, the implementation of Lit-LLaMA based on nanoGPT's LLaMA adds some model weights, and the license used by this part of the model is Apache2.0.
**The protocols adopted by the open source large model are mainly Apache 2.0 and MIT licenses. **Alpaca, Vicuna, Dolly, OpenAssistant, and MOSS are all under the Apache 2.0 license, Koala and GPT4all are under the MIT license. Both licenses allow commercial use. Unfortunately, Alpaca, Vicuna, Koala, and GPT4all are not commercially available due to OpenAI or LLaMA restrictions. At the same time, it is worth noting that both the Apache2.0 and MIT licenses allow the source code to be modified and then closed. The company can develop its own model based on the open source model, or it will be more attractive to the company.
3. Common point 2: open source large model with few parameters and miniaturization
"The size of model parameters" is positively related to "the model's demand for computing power".
**3.1 How big is the super-large model and the large model? **
**Pre-training endows the model with basic capabilities. **In natural language processing (NLP), pre-training refers to training a language model on a large text corpus before fine-tuning a specific task, giving the model basic language understanding capabilities. During pre-training, the model is trained to predict the next word in a sentence based on the previous context. This can be done by masking some of the words in the input and asking the model to predict them, or by autoregressive methods (such as GPT) where the next word is predicted based on the previous words in the sentence.
The pre-training model usually includes a large number of parameters and corresponding pre-training data (usually measured by the number of identifiers, namely Tokens). In 2017, the emergence of the Google Brain Team Transformer (transformer) model completely changed the face of NLP, enabling the model to better understand and process language, and improve the effect and accuracy of NLP tasks.
**How big is the extra-large model and the large model? **The size of a language model is measured according to its parameter quantity, which mainly describes the adjustable value of the connection strength between neurons. At present, the parameters of large language models generally range from tens to tens of billions. Those with more than 100 billion parameters are called "super-large models", such as GPT-3 (175 billion parameters).
3.2 The GPT super-large model has the strongest ability, but it is difficult to reproduce
** The performance evaluation criteria for large models are not unified. **An important reason is that there are many types of tasks for large models to generate content, and different application scenarios and tasks may require different indicators and methods to evaluate the performance of the model. Some of these tasks may have highly credible scoring standards, such as BLEU in machine translation, but most tasks lack similar standards.
** The fuzzy consensus is that very large models perform well. **The current development trend of the large language model is getting bigger and bigger (see the figure below for details), because the large model has better versatility and stability after pre-training. For example, the Google team's super-large model PaLM (540 billion parameters) has good results in both zero-sample and small-sample tests (see the figure below for details), and its performance can still improve as the number of training identifiers increases. This is not difficult to understand. Simply put, the more models you see, the more you will naturally know.
** "Peer Review", the GPT-based large model "Peerless Beauty". **Currently, the super-large model of the OpenAI GPT system has powerful capabilities and a wide range of applications. It has high accuracy and strong expressiveness when dealing with natural language tasks. It is used in many fields such as text generation, question answering systems, and machine translation. They have all achieved excellent results and have become one of the current benchmarks in the field of natural language processing, and are used as comparison benchmarks by various large models. The threshold for reproducing ChatGPT has not been lowered. Most of the large open source models only perform better in some aspects, and the overall quality is still incomparable with ChatGPT. It remains to be seen.
Recently, we have also noticed the following evaluation systems. The evaluation methods mainly include machine automatic evaluation (such as using GPT4), human blind evaluation, etc. We will focus on some of them and their evaluation results, but no matter which evaluation system, the GPT system The big models are all top notch.
3.2.1 Vicuna: Evaluation with GPT-4
**At present, the performance of most open source large models has not been systematically evaluated, and more are in the initial stage of experimentation. **Among the large open source models for evaluating performance, the evaluation using GPT-4 in Vicuna's report is relatively systematic and the results are the most impressive.
3.2.2 Zeno Build Evaluation: Newer and more comprehensive
Zeno Build evaluated seven models of GPT-2, LLaMA, Alpaca, Vicuna, MPT-Chat, Cohere Command, and ChatGPT (gpt-3.5-turbo), and the results were similar to those of GPT-4. ChatGPT has a clear advantage, and Vicuna performs best among open-source models.
3.2.3 C-: Comprehensive Chinese Basic Model Evaluation Kit
C- The evaluation results show that even in terms of Chinese ability, GPT-4 is the best, but GPT-4 can only achieve a correct rate of 67%. At present, the Chinese processing ability of the large model still has a lot of room for improvement.
3.2.4 GPT super-large model training costs are high, and it is difficult to reproduce in the short term
**ChatGPT requires considerable computing power and training costs. **Do not consider the computing power required for the reasoning process that is highly related to daily activities, only consider the training process, according to the calculation of the paper "Language Models are Few-Shot Learners", ChatGPT's previous generation GPT-3 (175 billion parameter version) The required computing power is as high as 3640PF-days (that is, if one quadrillion floating-point operations are performed per second, it needs to be calculated for 3640 days), and it is known that the computing power of a single Nvidia A100 graphics card is about 0.6PFLOPS, then train GPT-3 once (175 billion parameter version), about 6,000 Nvidia A100 graphics cards are needed. If the interconnection loss is considered, about tens of thousands of A100s are needed. The price of a single A100 chip is about 100,000 yuan, and large-scale training needs to invest about 1 billion yuan. . OpenAI spent more than $4 million on training GPT-3 (175 billion parameters), and in order to maintain the operation of ChatGPT and GPT4 (the number of parameters is not announced, it is expected to be higher), which is theoretically higher each month.
3.3 Open source large models are cost-effective, and the barriers to large models below super-large models are disappearing
**The trend of miniaturization of large open source models is obvious, and the parameters are about tens of billions. Cost reduction is the meaning of the question. **Open source large models usually have fewer parameters, and require relatively low resources and costs in terms of design, training, and deployment. The parameters of this wave of open source large models are generally small, at the level of one billion to ten billion.
"The boat is small and easy to turn around", fine-tuning based on the existing open source pre-training model is also one of the advantages of the open source large model. Fine-tuning and optimizing on the basis of the pre-trained model to adapt to different tasks and application scenarios, this method can not only greatly reduce the training time and cost of the model, but also improve the performance and efficiency of the model.
**With more identifier training data and new technologies, the barriers to large models below super-large models tend to disappear. **LLaMA is "open source", so that everyone has a large model that can be used, and with the development of technologies such as DeepSpeed and RLHF, tens of billions of models can be deployed on consumer-grade GPUs.
Fourth, common point three: open source large model data sets attach importance to human instructions, and stand on their own
"The size of the data set" is also positively related to the "computing power required by the model".
4.1 Learn ChatGPT methodology and introduce human instruction dataset
**Tweaking is a shortcut to improve specific performance. **Fine-tuning refers to further small-scale training on the pre-trained model using a task-specific dataset with labeled data. Fine-tuning can make the model more adaptable to task-specific data and scenarios at a small cost of computing power, thereby improving the performance and accuracy of the model. At present, fine-tuning is mostly instruction fine-tuning, and instruction data sets have gradually become the standard configuration of open source large models.
RLHF (Reinforcement Learning from Human Feedback, reinforcement learning based on human feedback) is an emerging fine-tuning method that uses reinforcement learning techniques to train language models and adjusts the output of the model based on human feedback. RLHF (reinforcement learning based on human feedback) is a feature that ChatGPT's early version GPT3 does not have, which makes InstructGPT with only 1.3 billion parameters show better authenticity, harmlessness and human instructions than GPT-3 with 175 billion parameters The degree of compliance is more recognized by the annotators without compromising the effect of GPT-3 on the academic evaluation dimension.
RLHF (Reinforcement Learning Based on Human Feedback) is divided into three steps: 1) Supervised fine-tuning (SFT): Let the annotator answer human questions, and use this annotation data to train GPT; 2) Reward model (RM) training: Let the annotator Sorting the answers of the machine, compared with the generative labeling in which the annotator directly writes the answer in the first step, the cost of sorting as a discriminative label is lower. Use this label to train the model and let it simulate human sorting; 3) No human Annotated, fine-tuned model with proximal policy optimization algorithm (PPO).
The sizes of the data sets corresponding to these three steps are 13,000, 33,000, and 31,000, respectively.
For companies with a large amount of data and a certain amount of computing power, using their own data for fine-tuning can demonstrate the specialization capabilities of the model, and use less computing power to achieve an effect close to the large model. For example, the Vicuna language model jointly developed by multiple schools, based on Meta's LLaMA-13 billion parameter version model, fine-tuned the ChatGPT dialogue instructions shared by 70,000 users, and achieved 92% of the effect of GPT4 on some tasks. It cannot exceed the super-large model in terms of versatility and stability, but it can strengthen its capabilities in some aspects through fine-tuning. The cost performance is higher and it is more suitable for small and medium-sized companies.
4.2 Datasets towards commercial use
Data sets are an important basis and support for the development of language models, and are usually collected, organized or directly purchased by companies or organizations. In contrast, open source datasets are mostly jointly maintained by the community or academia, and their data volume and types are more abundant, but there may be certain data quality problems and applicability differences.
4.2.1 A small amount of pre-training data sets are commercially available
**The open source of the pre-training data set is very important for the commercial use of the model. ** In the post-LLaMA era, large open source models sprung up like mushrooms after rain, but soon everyone found that due to the limitations of LLaMA and OpenAI, the models based on them were not commercially available (Alpaca, Koala, GPT4All, Vicuna), in order to break this situation , Dolly2.0 took the lead, "In order to solve this problem, we began to find ways to create a new, uncontaminated data set for commercial use. ’ followed by Red Pajama and MOSS.
4.2.2 Part of the instruction data set is commercially available
**Create an open source ecology, each takes what it needs. **In the early open source projects, due to the instruction data and mostly from ChatGPT generation or dialogue content, it was not commercially available due to OpenAI restrictions. In addition to fine-tuning for research purposes, more and more models choose to build their own instruction data sets to circumvent this limitation.
**Instruction datasets are diversified, and instruction datasets of some models are commercially available. **According to the above classification of large models in this batch, except for LLaMA, models developed based on LLaMA, and StableLM using OpenAI instruction data sets, the instruction data sets of other large models are not based on OpenAI, so these The commercial availability of instruction datasets for large models will accelerate the evolution and development of such large models using the RLHF (Reinforcement Learning with Human Feedback) training paradigm.
5. Outlook
We've noticed that open source megamodels are heading towards a similar intersection.
5.1 Multimodality: Boosting the Development of General Artificial Intelligence (AGI)
**Multi-modal open source large models have begun to appear, pushing large models to a new climax and helping humans move towards general artificial intelligence. Multi-modality refers to the integration of various modes such as images, sounds, and texts. Multimodal models are based on machine learning techniques that can process and analyze multiple input types, making large models more versatile. Based on multi-domain knowledge, build a unified, cross-scenario, and multi-task model, and promote human beings to the era of Artificial General Intelligence (AGI). **
5.1.1 ImageBind debuted, using images to open up 6 modes
**ImageBind's open source large model can go beyond a single sensory experience, allowing machines to have the ability to "associate". ** On May 9th, Meta Corporation announced the open source multimodal large model ImageBind. The model takes image as the core and can open up 6 modes, including image (picture/video), temperature (infrared image), text, audio, depth information (3D), and motion capture sensor (IMU). The relevant source code has been hosted on GitHub. The team said that in the future, modalities such as touch, smell, and brain magnetic resonance signals will also be added.
Technically, ImageBind leverages network data (e.g. images, text) and combines it with naturally occurring paired data (e.g. audio, depth information, etc.) to learn a single joint embedding space such that ImageBind implicitly combines text Embeddings are aligned to other modalities, enabling zero-shot recognition on these modalities without explicit semantic or textual pairing.
Typical use cases for ImageBind currently include: input the sound of a dog barking to the model, and the model outputs a picture of a dog, and vice versa; input a picture of a bird and the sound of ocean waves to the model, and the model outputs a picture of a bird on the beach, and vice versa.
5.1.2 The multimodal exploration of open source large models focuses on pictures, but the progress is rapid
At present, the exploration of multimodality in open source large models is still in its infancy. Except for ImageBind, which has opened up six modalities, most of them are still exploring the fusion of text and images, but the speed is quite fast. We have sorted out some of them.
VisualGLM-6B: Deployable locally on consumer graphics cards
UniDiffuser: UniDiffuser, a probabilistic modeling framework designed for multimodality
LLaVA: The performance of some instructions is comparable to GPT-4
MiniGPT-4: A multi-modal open source large model born out of LLaMA, the GPT-4 "replacement" for individual users
mPLUG-Owl: Modular Multimodal Large Model
5.2 Specialization: Downstream ecological force, fine-tuning the model for specific tasks
The open source of large models provides an excellent opportunity for the vigorous growth of downstream ecology. Under the development of subdivided industries, large models begin to be further developed on specific tasks and change human life. Since the launch of the open source large model LLaMA, downstream specialized models based on LLaMA pre-training model fine-tuning have begun to emerge, such as Huatuo in the field of medical consultation.
Huatuo may be the paradigm for the development of specific task models downstream of the open source large model in the future, that is, use a small open source large model with low parameter volume as the basic model, and train with data from specific professional fields to obtain better performance segmentation domain model.
6. Investment Advice
The development of open-source large models has far-reaching impacts. This report selects some directions that may benefit, and draws market attention.
6.1 Microsoft: In-depth cooperation with OpenAI
We believe that in the short term, the ChatGPT system is still the most capable large model, and Microsoft will benefit from its in-depth cooperation.
6.2 Nvidia: Open-source large models drive the popularity of applications, and the demand for computing power is soaring
Computing power service is a direction with strong benefit and certainty in the wave of open source large models. It has a clear leading edge in the integration of software and hardware, and is the current leader in AI computing power.
6.2.1 The demand for computing power of super-large models will maintain high growth
The super-large model has outstanding quality advantages, and the market will continue to pursue it, and its demand for computing power will continue to grow. Super-large models have strong expressive power and high accuracy, and have advantages in quality, and the market will continue to pursue such models. The scale of super-large models, data sets, and daily activities continue to expand, and the required computing power will continue to increase.
6.2.2 The rapid catch-up of open source large models will also benefit computing power
In the short term, the market will take a wait-and-see attitude towards open source big models. Large open-source models have poor versatility and cannot compete with large-scale models in a short period of time. In addition, it is currently difficult to systematically evaluate the specific performance of models. The market is holding a wait-and-see attitude towards large open-source models, waiting for them to prove their performance and advantages.
** In the medium and long term, open source large models are expected to further improve performance, thereby occupying a larger share in the market. **Compared with super-large models, open-source large-scale models have lower computing power requirements and are easier to deploy. They can also be optimized for certain professional fields through quick fine-tuning and other methods, which are attractive and practical. In the medium to long term, if there is an open source large model that can approach or surpass the performance of ChatGPT in terms of quality, the market demand for such models may rise rapidly. Correspondingly, this type of computing power demand will quickly increase.
6.2.3 Catalyst: Development of Open Source Large Model Licenses, Standards and Capability Evaluation System
6.3 Meta: Open source "vanguard", benefiting from the open source ecology
Looking back on the development history of Android, we are optimistic about the Google-like role in the "Google-Android" system. In this system, Google, as the developer of the open source operating system Android, uses open source as a tool to stimulate the development of the upstream and downstream of the ecology, and enhance its proprietary Exposure of services to end customers.
Mapped to the large model, we believe that the open source Meta of LLaMA may deepen the cooperation with downstream large model development manufacturers through LLaMA, and sell the proprietary products in its own system to customers.
6.4 Other
6.4.1 Edge Computing Power + Open Source Model: Landing Accelerator for AI Applications
Edge computing power can place reasoning calculations on users' devices, which can not only improve the speed and efficiency of data processing, thereby reducing the cost of reasoning, but also protect the privacy and security of users.
6.4.2 Big data companies: Optimistic about the combination of "open source large model + self-owned massive data"
For enterprises that "have a lot of data but insufficient computing power", using their own data to fully pre-train and fine-tune open-source commercial models is more cost-effective. This can improve the accuracy and applicability of the model, and can also greatly reduce the model training time and cost. In addition, the fine-tuned model can better meet the specific needs and business scenarios of the enterprise, thereby enhancing the competitiveness and innovation capabilities of the enterprise. With the continuous development and popularization of technology, independent fine-tuning models have become an important means for enterprises to use their own data to quickly realize intelligent applications.
6.4.3 Open Source Large Model Service Provider: Service First
Looking back at the development history of Red Hat, we believe that even if the large model enters the open source era, 24*7 customer-oriented services are still essential, especially for enterprises. We are optimistic about open source large model service providers.
6.4.4 Apple: Get ChatGPT App Revenue Share
ChatGPT is listed on the App Store, and according to the practice of the App Store, Apple will get a share of the revenue.