Open Source Race: AI Big Model's "Linux Moment" Arrives

Look at the future of AI large models from the development history of open source benchmark Linux.

Written by: Song Jiaji, Sun Shuang

Shortly after the release of ChatGPT, Meta open sourced the GPT-like large language model LLaMA. Since then, several large models such as Alpaca, Vicuna, and Koala have been born. They have achieved impressive performance at a model scale and cost much lower than ChatGPT. It has caused industry insiders to worry that "Neither Google nor OpenAI has a moat, and the threshold of large models is being broken by open source, and they will be replaced if they do not cooperate." The capital market is also paying attention to the future competition pattern of large models. Does a small model no longer require a lot of computing power? What role does data play in it? ...This report attempts to analyze the common ground of this wave of open source large language models, review the development history of the open source benchmark Linux, and answer these questions.

**Common point 1: Start with open source. **Open source ≠ free. The open source business model includes at least: 1. Monetization through services. One example is Red Hat, a Linux enterprise service company that was once listed and later acquired by IBM. Enterprises are willing to pay for more stable and timely technical support. 2. Realize by licensing fees. Android is open source, but Google charges license fees to manufacturers in the EU for using the Android Google suite. 3. The development of licenses, standards, and capability evaluation systems is a catalyst for deepening the commercialization of open source large models. The license agreements adopted by this wave of open source large models are mainly Apache 2.0 and MIT, which do not prohibit commercial use, and do not prohibit users from modifying the model and then closing the source, which helps companies apply such large models.

** Common point 2: less parameters and miniaturization. **Compared to the GPT3+ ultra-large model with 100 billion parameters, the parameters of this wave of open source large models are generally at the level of one billion to ten billion. At present, there is no systematic large-scale model performance evaluation system, and only some tasks have scoring standards with strong credibility. Among the large open source models, Vicuna is also more capable, and can achieve 92% GPT4 performance in some tasks. Generally speaking, the OpenAI GPT system is still the best, but the training cost is high and it is difficult to reproduce. The open source large model achieves low training cost and high performance with the help of larger identifier training data sets, DeepSpeed, RLHF, etc., and the barriers to large models below super large models are disappearing.

** Common point 3: The data set attaches great importance to human instructions and is commercially available. **An important factor for ChatGPT's substantial improvement compared to GPT3 is the use of RLHF (reinforcement learning based on human feedback), that is, in training, using human-generated answers and sorting AI-generated content to allow AI to "align" human preference. LLaMA does not use instruction fine-tuning, but a large number of large models after LLaMA use and open source instruction data sets, and gradually explore self-built instruction data sets instead of using OpenAI with commercial restrictions, which further reduces the threshold for reproducing GPT and expands commercial availability.

**How to look at the open source big model next? **Standing in the wave of open source large models, we have noticed two trends: 1) Integration with multimodality, Tsinghua University's VisualGLM-6B is a multimodal upgraded version of the famous open source language model ChatGLM, we believe that it can The feature of local deployment based on consumer-grade graphics cards is the general trend. 2) The open source model + edge computing promotes the commercialization of AI. Harbin University's Chinese medical consultation model "Huatuo" and its use in cross-border e-commerce are examples.

Investment Suggestion: We believe that views on large models should be viewed in time and in layers. 1. In the short term, OpenAI’s GPT-based super-large model still surpasses other large-scale open-source models. Therefore, we should focus on Microsoft, which has in-depth cooperation with it in equity and products, Apple, which can obtain ChatGPTiosApp’s revenue share, and computing power service providers of super-large models. Nvidia, etc.; 2. In the medium and long term, if the capabilities of some open source large models are further verified, applications will be rolled out quickly, and large models will form a positive cycle for computing power; 3. Others: edge computing power, big data companies and open source The large-scale service business model is also worthy of attention. Suggested attention: 1) Optical module service providers: Zhongji InnoLight, Xinyisheng, Tianfu Communication, Yuanjie Technology; 2) Smart module service providers: MeiG Smart, Fibocom; 3) Edge IDC service providers: Longyu shares, Wangsu Technology; 4) AIoT communication chip and equipment manufacturers: ZTE, Tsinghua Unigroup, Ruijie Networks, Feiling Kesi, Fii, Aojie Technology, Chuling Information; 5) Application terminal labels: Yingying Network, Shenzhou Taiyue, Jiaxun Feihong, Zhongke Jincai, etc.

**Risk reminder: ethical risk, market competition risk, policy and legal supervision risk. **

I. Introduction

A report sparked intense public interest in open source large language models.

1.1 "Neither Google nor OpenAI has a moat, and the threshold of large models is being broken by open source"

** "Unless Google and OpenAI change their attitudes and choose to cooperate with the open source community, they will be replaced by the latter", **According to Bloomberg and SemiAnalysis reports, in early April, Google engineer Luke Sernau stated that in the artificial intelligence big language model ( Large Language Models, LLM, hereinafter referred to as "large model") track, Google and OpenAI, the launcher of ChatGPT, have no moat, and the open source community is winning the race.

This argument brought the public's attention to the climax of the phenomenon of "a large number of large models appeared after Meta open source large model LLaMA at the beginning of the year". Among the three key elements of "model", "computing power" and "data", what is the future competition pattern of large models? Will a large amount of computing power be no longer needed if the model is small? What role does data play in it? ...This report attempts to analyze the common ground of this wave of open source large models, review the development history of open source benchmark Linux, answer the above questions, and look forward to the future of large models.

1.2 Open source large-scale models appear intensively, which can be called a trend

On February 24, Meta released the LLaMA open source large model. Since then, a number of large models have emerged in the market, which can be roughly divided into three categories.

1.2.1 "LLaMA series": good performance, but low degree of commercialization

LLaMA includes four different parameter versions (7 billion/13 billion/33 billion/65 billion), not commercially available, the command dataset is based on OpenAI, and the model performance can be equal to or better than GPT-3**. **Among them, the 7 billion and 13 billion parameter versions have a pre-training data set containing 1 trillion tokens; the 33 billion and 65 billion parameter versions have a pre-training data set containing 1.4 trillion tokens. In comparison with GPT-3, the LLaMA-7 billion parameter version performs at the same level as GPT-3 on commonsense reasoning tasks, zero-shot tasks, natural questions, and reading comprehension, while the version model with 13 billion parameters and higher The performance in the above fields is better than GPT-3.

The LLaMA model itself does not use the instruction data set, but considering that ChatGPT, which is better than GPT-3, uses the human instruction data set, a batch of open source large models use the OpenAI instruction data set to optimize the performance of the model based on the LLaMA model. Includes Alpaca, GPT4All, Vicuna, Koala, Open Assistant, and Hugging Chat. Since the OpenAI instruction data set is not commercially available, these LLaMA-based open source large models are also not commercially available.

1.2.2 Dolly2.0, RedPajama, StableLM, etc.: high degree of commercialization

These large models do not use the OpenAI instruction dataset, so they are commercially available, but most are still under continuous development.

1.2.3 Chinese Gemini: ChatGLM-6B and MOSS

ChatGLM-6B and MOSS were launched by relevant research groups of Tsinghua University and Fudan University respectively, and are well-known in the Chinese community.

The models also share some commonalities, which the report details below.

Second, common point 1: starting from open source

** In this wave, whether it is the model itself or the data set used by the model, the first thing they have in common is "open source". **

**2.1 Why open source? **

The important question for the market to open source large models is why it should be open sourced, and whether this will damage the business model of the large model industry. We sorted out the self-reports of some large models on the reasons for open source, and summarized them as follows.

2.1.1 Model Perspective: Prevent monopoly of large companies and break commercial prohibition restrictions

In order to democratize artificial intelligence research, bridge the quality gap between open and closed models, and remove commercial prohibition restrictions, the vigorous development of open source large models is expected to promote the above goals.

2.1.2 Data Perspective: Protect corporate secrets and make customized data training possible

**Guarantee data privacy and allow enterprises to customize development. **For many industries, data is the lifeblood of enterprises. The open source of large models enables enterprises to train their own data sets on large models, while achieving data control and protecting enterprise data privacy. At the same time, the open source large model allows enterprise developers to carry out customized development on the basis of the model, target training data, and filter certain topics, reducing the size of the model and the training cost of the data.

2.1.3 Perspective of computing power: reduce the cost of computing power and make the use of large models "inclusive"

**The open source large model saves computing power consumption in the training phase, reduces computing power costs for enterprises, and promotes the "inclusive" use of large models. **Total computing power requirement = number of scenarios* computing power requirement for a single scenario. In the training and use of large models, computing power consumption is divided into two scenarios, namely training cost consumption and inference cost consumption.

  • In terms of training cost, the training cost of large models is high, and the computing power resources of ordinary enterprises are unbearable, while the open source large model mainly saves the computing power in the pre-training stage of enterprises. However, due to the richer training scenarios of different vertical categories, the overall training demand is increasing.
  • In terms of inference cost, the inference cost of large models is also high when the parameters are huge, and it is difficult for ordinary companies to maintain their daily expenses. Therefore, reducing the size of model parameters can further reduce the reasoning of enterprises when using the model cost.

**2.2 Open source, what kind of soil do you need? **

**The flourishing of open source megamodels is not without precedent, and Linux, the world's largest open source software project, has a similar story. **Researching the development history of Linux has reference significance for looking forward to the future of the open source large model.

2.2.1 Let’s start from the open source benchmark Linux

**Linux is a free and open source operating system released under the GNU General Public License (GPL). **Anyone can run, study, share and modify this software. Modified code can also be redistributed and even sold, but only under the same license. Traditional operating systems, such as Unix and Windows, are proprietary systems that are vendor-locked, delivered as-is, and cannot be modified.

Many of the world's largest industries and businesses rely on Linux. Today, Linux is everywhere, from knowledge-sharing sites like Wikipedia, to the New York Stock Exchange, to mobile devices running Android, a dedicated distribution of the Linux kernel that includes free software. Today, Linux is not only the most commonly used operating system on public Internet servers, but also the only operating system used on the top 500 fastest supercomputers.

**In the server market, the market share of Linux has far surpassed that of the "grandfather" of the operating system, Unix, and the "Linux moment" happened. **Taking the Chinese market as an example, according to the data of CCID Consulting and the statistics of installed capacity, in terms of server architecture, Linux is the mainstream in the market, occupying an absolute leading position, with a market share of 79.1%. Windows' market share fell to 20.1%, and Unix's market share was only 0.8%.

2.2.2 Linux is not a work of its own, leveraging on the history of open source behind the community

Unix has been open source, providing fire for Linux

**Unix, the originator of the modern operating system. **The operating system refers to the software that directly manages system hardware and resources (such as CPU, memory, and storage space). It is located between applications and hardware and is responsible for establishing connections between all software and related physical resources. Unix is considered by many to be the ancestor of modern operating systems.

**Unix was once open source. **The world's first general-purpose computer was born in 1946, and Unix was developed in 1969. For ten years, AT&T, the owner of UNIX, has licensed the Unix source code to academic institutions for research or teaching with low-cost or even free licenses. Many institutions have expanded and improved this source code, forming the so-called "Unix variants". Later, AT&T realized the commercial value of Unix, no longer licensed Unix source code to academic institutions, and declared copyright rights to the previous Unix and its variants

Unix is too expensive after returning to closed source, which led to the development of Linux

Linux was designed and launched by Linux Torvalds in 1991. At that time, he was still in college and thought that the popular commercial operating system Unix at that time was too expensive, so he developed Linux based on the Unix-like operating system Minix and opened it to people like himself who could not afford it. from the team.

Minix for teaching only, inspired the development of Linux

After AT&T privatized the source code, Tanenbaum, a professor at Vrije University Amsterdam in the Netherlands, decided to develop a UNIX-compatible homework without using any AT&T source code in order to teach students the practical details of operating system operations in class. system to avoid copyright disputes. He called it MINIX with the meaning of mini-UNIX (mini-UNIX). The first version of MINIX was released in 1987, and you only need to buy its disk to use it. Before the Linux system did not have its own native file system, the file system of Minix was used.

Open source community, license and standard support

** Open source from the start. **In August 1991, Linux founder Linus Torvalds posted Linux to the Minix Usenet newsgroup. Then he released Linux to the FTP site because he wanted more people to develop the kernel together.

**The license helps the ecology to flourish and flourish. **Linux is based on the GNU GPL license (GNU's Not Unix General Public License, Genu Project General Public License) model. The GPL license grants the four freedoms that "free software" grants users, or "Copyleft (public copyright)":

  • Freedom Zero: Freedom to "use" the software for whatever purpose.
  • One of the freedoms: the freedom to "study how the software works" and to "modify" the software to meet the user's own needs. Access to source code is a prerequisite for this freedom.
  • Freedom 2: There is the freedom to "distribute copies of software", so everyone can build good neighborliness by distributing free software.
  • Freedom 3: The freedom to "publish a revised version" so that the entire community can benefit. Access to source code is a prerequisite for this freedom.

The GPL license requires that derivative works of the GPL program also follow the GPL license model. In contrast, licenses such as BSD-style do not prohibit derivative works from being made into proprietary software. The GPL is the most popular license for free and open source software. Complying with the GPL license enables the Linux ecosystem to continue to thrive, so as not to enter a "dead end" where it cannot continue to develop.

**Standards internally make the ecology "shape scattered but spirit not scattered", and internally embrace the "giant whale". **

  • **Internal unified standard. **Linux has formulated a standard LSB (Linux Standard Base, Linux Standard Base) to standardize development, so as to avoid the development results of various teams from being too different. Therefore, the various Linux-derived development tools differ only in such things as suite management tools and modes. We believe that this makes the development of the Linux open source community "disintegrated but not dispersed", so that the development of the Linux ecosystem will not fall apart.
  • **Externally compatible with Unix. **In order to make Linux compatible with Unix software, Linus Torvalds modified Linux with reference to the POSIX (Portable Operating Interface) standard, which greatly increased the usage of Linux. This standard was developed by IEEE (Institue of Electrical and Electronics Engineers, Institute of Electrical and Electronics Engineers) in the 1990s. It is the initial stage of Linux. Portability provides a favorable environment for the promotion of Linux.

**2.3 Open source, how to make money? **

The core question in the market about "open source" is the business model. "Open source" itself is free, but "open source" is the soil, and the "open source community" has bred various business models, which can be learned from the Linux ecosystem.

2.3.1 Red Hat: Service First

Red Hat is a leader in the Linux ecosystem. More than 90% of the Fortune 500 companies trust Red Hat. Red Hat has huge commercial value as a company. In 1993, Red Hat was established. In 1999, Red Hat was listed on Nasdaq. According to the Red Hat prospectus, citing IDC data, as of 1998, 56% of all authorized new installations of the Linux operating system came from Red Hat. In 2012, Red Hat became the first open source technology company with more than $1 billion in revenue; in 2019, IBM acquired Red Hat for approximately $34 billion.

Regarding the business model of Linux and Red Hat, it is like the analogy of Curiosity Daily. In a sense, the open source Linux kernel is like a free and open recipe, and Red Hat is like a restaurant. People are still willing to go to the restaurant to taste processed dishes. and enjoy attentive service. Red Hat provides Linux operating systems and subscription services for enterprises. The main services include: 1. 24*7 technical support; 2. Cooperating with upstream communities and hardware manufacturers to support a wide range of hardware architectures, such as x86, ARM, IBM Power etc.; 3. Continuous vulnerability alerts, directional guidance, and automatic repair services; 4. Deployment across multiple clouds; 5. Security protection functions such as real-time kernel patching and security standard certification; 6. Detect performance anomalies and build a comprehensive view of system performance , and apply preset tuning profiles, etc.

2.3.2 Android system (Android): backed by Google, monetized by advertising

According to Statcounter data, as of April 2023, Android (Android) is the world's number one mobile phone operating system, with a market share of 69%, far exceeding the second place (iOS, 31%). Android is developed based on the Linux kernel and was acquired by Google in 2005. Subsequently, Google released the source code of Android under the authorization of Apache's free open source license, enabling manufacturers to quickly launch Android-equipped smartphones, which accelerated the popularity of Android.

As for the business model, many services pre-installed on Android phones are provided by Google’s proprietary products, such as maps, Google Play app store, search, and Google Mail (Gmail). Therefore, although Android is free and open source, Google can still use its The mobile market "sieges cities and territories" and monetizes user traffic.

Google also charges licensing fees directly from Android mobile phone manufacturers. Starting from October 29, 2018, EU manufacturers using Android-based mobile phones and tablets must pay Google a licensing fee for each device. Equipment can cost up to $40.

2.4 The mainstream license of the open source large model supports commercial use

The open source community already has well-known licenses such as GPL, BSD, and Apache. In terms of large models, we have noticed that LLaMA, which was released in February 2023 and led the wave of open source large models, is prohibited for commercial use and can only be used for research. MetaAI will be awarded to civil servants, members of social groups, academic personnel and industry research experiments according to specific circumstances. room, access to the model. Among them, the reasoning code of LLaMA is based on the GPL3.0 license, which means: 1) After others modify the reasoning code of LLaMA, the source cannot be closed; 2) The new code must also adopt the GPL license. However, we noticed that some developers have developed variant models based on LLaMA with different types of licenses. For example, the implementation of Lit-LLaMA based on nanoGPT's LLaMA adds some model weights, and the license used by this part of the model is Apache2.0.

**The protocols adopted by the open source large model are mainly Apache 2.0 and MIT licenses. **Alpaca, Vicuna, Dolly, OpenAssistant, and MOSS are all under the Apache 2.0 license, Koala and GPT4all are under the MIT license. Both licenses allow commercial use. Unfortunately, Alpaca, Vicuna, Koala, and GPT4all are not commercially available due to OpenAI or LLaMA restrictions. At the same time, it is worth noting that both the Apache2.0 and MIT licenses allow the source code to be modified and then closed. The company can develop its own model based on the open source model, or it will be more attractive to the company.

3. Common point 2: open source large model with few parameters and miniaturization

"The size of model parameters" is positively related to "the model's demand for computing power".

**3.1 How big is the super-large model and the large model? **

**Pre-training endows the model with basic capabilities. **In natural language processing (NLP), pre-training refers to training a language model on a large text corpus before fine-tuning a specific task, giving the model basic language understanding capabilities. During pre-training, the model is trained to predict the next word in a sentence based on the previous context. This can be done by masking some of the words in the input and asking the model to predict them, or by autoregressive methods (such as GPT) where the next word is predicted based on the previous words in the sentence.

The pre-training model usually includes a large number of parameters and corresponding pre-training data (usually measured by the number of identifiers, namely Tokens). In 2017, the emergence of the Google Brain Team Transformer (transformer) model completely changed the face of NLP, enabling the model to better understand and process language, and improve the effect and accuracy of NLP tasks.

**How big is the extra-large model and the large model? **The size of a language model is measured according to its parameter quantity, which mainly describes the adjustable value of the connection strength between neurons. At present, the parameters of large language models generally range from tens to tens of billions. Those with more than 100 billion parameters are called "super-large models", such as GPT-3 (175 billion parameters).

3.2 The GPT super-large model has the strongest ability, but it is difficult to reproduce

** The performance evaluation criteria for large models are not unified. **An important reason is that there are many types of tasks for large models to generate content, and different application scenarios and tasks may require different indicators and methods to evaluate the performance of the model. Some of these tasks may have highly credible scoring standards, such as BLEU in machine translation, but most tasks lack similar standards.

** The fuzzy consensus is that very large models perform well. **The current development trend of the large language model is getting bigger and bigger (see the figure below for details), because the large model has better versatility and stability after pre-training. For example, the Google team's super-large model PaLM (540 billion parameters) has good results in both zero-sample and small-sample tests (see the figure below for details), and its performance can still improve as the number of training identifiers increases. This is not difficult to understand. Simply put, the more models you see, the more you will naturally know.

** "Peer Review", the GPT-based large model "Peerless Beauty". **Currently, the super-large model of the OpenAI GPT system has powerful capabilities and a wide range of applications. It has high accuracy and strong expressiveness when dealing with natural language tasks. It is used in many fields such as text generation, question answering systems, and machine translation. They have all achieved excellent results and have become one of the current benchmarks in the field of natural language processing, and are used as comparison benchmarks by various large models. The threshold for reproducing ChatGPT has not been lowered. Most of the large open source models only perform better in some aspects, and the overall quality is still incomparable with ChatGPT. It remains to be seen.

Recently, we have also noticed the following evaluation systems. The evaluation methods mainly include machine automatic evaluation (such as using GPT4), human blind evaluation, etc. We will focus on some of them and their evaluation results, but no matter which evaluation system, the GPT system The big models are all top notch.

  • abroad
  • Chatbot Arena of University of Berkeley draws on the game's qualifying mechanism to allow humans to blindly evaluate models in pairs;
  • The open source toolkit Zeno Build, through Hugging Face or online API, uses Critique to evaluate multiple large models.
  • Overseas
  • SuperCLUE is a comprehensive evaluation benchmark for Chinese general-purpose large models, trying to automatically evaluate large models;
  • C- 14,000 multiple-choice questions covering 52 subjects are used to evaluate the Chinese ability of the model. Similar standards still need time and market testing.

3.2.1 Vicuna: Evaluation with GPT-4

**At present, the performance of most open source large models has not been systematically evaluated, and more are in the initial stage of experimentation. **Among the large open source models for evaluating performance, the evaluation using GPT-4 in Vicuna's report is relatively systematic and the results are the most impressive.

3.2.2 Zeno Build Evaluation: Newer and more comprehensive

Zeno Build evaluated seven models of GPT-2, LLaMA, Alpaca, Vicuna, MPT-Chat, Cohere Command, and ChatGPT (gpt-3.5-turbo), and the results were similar to those of GPT-4. ChatGPT has a clear advantage, and Vicuna performs best among open-source models.

3.2.3 C-: Comprehensive Chinese Basic Model Evaluation Kit

C- The evaluation results show that even in terms of Chinese ability, GPT-4 is the best, but GPT-4 can only achieve a correct rate of 67%. At present, the Chinese processing ability of the large model still has a lot of room for improvement.

3.2.4 GPT super-large model training costs are high, and it is difficult to reproduce in the short term

**ChatGPT requires considerable computing power and training costs. **Do not consider the computing power required for the reasoning process that is highly related to daily activities, only consider the training process, according to the calculation of the paper "Language Models are Few-Shot Learners", ChatGPT's previous generation GPT-3 (175 billion parameter version) The required computing power is as high as 3640PF-days (that is, if one quadrillion floating-point operations are performed per second, it needs to be calculated for 3640 days), and it is known that the computing power of a single Nvidia A100 graphics card is about 0.6PFLOPS, then train GPT-3 once (175 billion parameter version), about 6,000 Nvidia A100 graphics cards are needed. If the interconnection loss is considered, about tens of thousands of A100s are needed. The price of a single A100 chip is about 100,000 yuan, and large-scale training needs to invest about 1 billion yuan. . OpenAI spent more than $4 million on training GPT-3 (175 billion parameters), and in order to maintain the operation of ChatGPT and GPT4 (the number of parameters is not announced, it is expected to be higher), which is theoretically higher each month.

3.3 Open source large models are cost-effective, and the barriers to large models below super-large models are disappearing

**The trend of miniaturization of large open source models is obvious, and the parameters are about tens of billions. Cost reduction is the meaning of the question. **Open source large models usually have fewer parameters, and require relatively low resources and costs in terms of design, training, and deployment. The parameters of this wave of open source large models are generally small, at the level of one billion to ten billion.

"The boat is small and easy to turn around", fine-tuning based on the existing open source pre-training model is also one of the advantages of the open source large model. Fine-tuning and optimizing on the basis of the pre-trained model to adapt to different tasks and application scenarios, this method can not only greatly reduce the training time and cost of the model, but also improve the performance and efficiency of the model.

**With more identifier training data and new technologies, the barriers to large models below super-large models tend to disappear. **LLaMA is "open source", so that everyone has a large model that can be used, and with the development of technologies such as DeepSpeed and RLHF, tens of billions of models can be deployed on consumer-grade GPUs.

  • More identifiers training data may be more important than more parameters: DeepMind's study "Training Compute-Optimal Large Language Models" published on March 29, 2022 reveals to us that model size The relationship between and the size of the training data:
  • Large models are often undertrained, resulting in a large waste of computing power.
  • Training more fully with smaller models can achieve better performance than larger models. For example, DeepMind's Chinchilla model has only 70 billion parameters. After training with a training data set of 1.4 trillion identifiers, the test effect is better than that of DeepMind's Gopher (280 billion parameters, 300 billion identifier training data set) and OpenAI's GPT. -3 (175 billion parameters, 300 billion identifiers training dataset).
  • In order to achieve better model performance, every time the number of model parameters doubles, the size of the identifier training data set should also double accordingly.
  • Smaller models also mean smaller downstream fine-tuning and inference costs.
  • DeepSpeed technology: can significantly reduce the time and cost of training large models;
  • RLHF (Reinforcement Learning Based on Human Feedback): Can improve the performance and accuracy of the model with a small amount of identifier training.

Fourth, common point three: open source large model data sets attach importance to human instructions, and stand on their own

"The size of the data set" is also positively related to the "computing power required by the model".

4.1 Learn ChatGPT methodology and introduce human instruction dataset

**Tweaking is a shortcut to improve specific performance. **Fine-tuning refers to further small-scale training on the pre-trained model using a task-specific dataset with labeled data. Fine-tuning can make the model more adaptable to task-specific data and scenarios at a small cost of computing power, thereby improving the performance and accuracy of the model. At present, fine-tuning is mostly instruction fine-tuning, and instruction data sets have gradually become the standard configuration of open source large models.

RLHF (Reinforcement Learning from Human Feedback, reinforcement learning based on human feedback) is an emerging fine-tuning method that uses reinforcement learning techniques to train language models and adjusts the output of the model based on human feedback. RLHF (reinforcement learning based on human feedback) is a feature that ChatGPT's early version GPT3 does not have, which makes InstructGPT with only 1.3 billion parameters show better authenticity, harmlessness and human instructions than GPT-3 with 175 billion parameters The degree of compliance is more recognized by the annotators without compromising the effect of GPT-3 on the academic evaluation dimension.

RLHF (Reinforcement Learning Based on Human Feedback) is divided into three steps: 1) Supervised fine-tuning (SFT): Let the annotator answer human questions, and use this annotation data to train GPT; 2) Reward model (RM) training: Let the annotator Sorting the answers of the machine, compared with the generative labeling in which the annotator directly writes the answer in the first step, the cost of sorting as a discriminative label is lower. Use this label to train the model and let it simulate human sorting; 3) No human Annotated, fine-tuned model with proximal policy optimization algorithm (PPO).

The sizes of the data sets corresponding to these three steps are 13,000, 33,000, and 31,000, respectively.

For companies with a large amount of data and a certain amount of computing power, using their own data for fine-tuning can demonstrate the specialization capabilities of the model, and use less computing power to achieve an effect close to the large model. For example, the Vicuna language model jointly developed by multiple schools, based on Meta's LLaMA-13 billion parameter version model, fine-tuned the ChatGPT dialogue instructions shared by 70,000 users, and achieved 92% of the effect of GPT4 on some tasks. It cannot exceed the super-large model in terms of versatility and stability, but it can strengthen its capabilities in some aspects through fine-tuning. The cost performance is higher and it is more suitable for small and medium-sized companies.

4.2 Datasets towards commercial use

Data sets are an important basis and support for the development of language models, and are usually collected, organized or directly purchased by companies or organizations. In contrast, open source datasets are mostly jointly maintained by the community or academia, and their data volume and types are more abundant, but there may be certain data quality problems and applicability differences.

4.2.1 A small amount of pre-training data sets are commercially available

**The open source of the pre-training data set is very important for the commercial use of the model. ** In the post-LLaMA era, large open source models sprung up like mushrooms after rain, but soon everyone found that due to the limitations of LLaMA and OpenAI, the models based on them were not commercially available (Alpaca, Koala, GPT4All, Vicuna), in order to break this situation , Dolly2.0 took the lead, "In order to solve this problem, we began to find ways to create a new, uncontaminated data set for commercial use. ’ followed by Red Pajama and MOSS.

4.2.2 Part of the instruction data set is commercially available

**Create an open source ecology, each takes what it needs. **In the early open source projects, due to the instruction data and mostly from ChatGPT generation or dialogue content, it was not commercially available due to OpenAI restrictions. In addition to fine-tuning for research purposes, more and more models choose to build their own instruction data sets to circumvent this limitation.

**Instruction datasets are diversified, and instruction datasets of some models are commercially available. **According to the above classification of large models in this batch, except for LLaMA, models developed based on LLaMA, and StableLM using OpenAI instruction data sets, the instruction data sets of other large models are not based on OpenAI, so these The commercial availability of instruction datasets for large models will accelerate the evolution and development of such large models using the RLHF (Reinforcement Learning with Human Feedback) training paradigm.

5. Outlook

We've noticed that open source megamodels are heading towards a similar intersection.

5.1 Multimodality: Boosting the Development of General Artificial Intelligence (AGI)

**Multi-modal open source large models have begun to appear, pushing large models to a new climax and helping humans move towards general artificial intelligence. Multi-modality refers to the integration of various modes such as images, sounds, and texts. Multimodal models are based on machine learning techniques that can process and analyze multiple input types, making large models more versatile. Based on multi-domain knowledge, build a unified, cross-scenario, and multi-task model, and promote human beings to the era of Artificial General Intelligence (AGI). **

5.1.1 ImageBind debuted, using images to open up 6 modes

**ImageBind's open source large model can go beyond a single sensory experience, allowing machines to have the ability to "associate". ** On May 9th, Meta Corporation announced the open source multimodal large model ImageBind. The model takes image as the core and can open up 6 modes, including image (picture/video), temperature (infrared image), text, audio, depth information (3D), and motion capture sensor (IMU). The relevant source code has been hosted on GitHub. The team said that in the future, modalities such as touch, smell, and brain magnetic resonance signals will also be added.

Technically, ImageBind leverages network data (e.g. images, text) and combines it with naturally occurring paired data (e.g. audio, depth information, etc.) to learn a single joint embedding space such that ImageBind implicitly combines text Embeddings are aligned to other modalities, enabling zero-shot recognition on these modalities without explicit semantic or textual pairing.

Typical use cases for ImageBind currently include: input the sound of a dog barking to the model, and the model outputs a picture of a dog, and vice versa; input a picture of a bird and the sound of ocean waves to the model, and the model outputs a picture of a bird on the beach, and vice versa.

5.1.2 The multimodal exploration of open source large models focuses on pictures, but the progress is rapid

At present, the exploration of multimodality in open source large models is still in its infancy. Except for ImageBind, which has opened up six modalities, most of them are still exploring the fusion of text and images, but the speed is quite fast. We have sorted out some of them.

VisualGLM-6B: Deployable locally on consumer graphics cards

  • Team: VisualGLM-6B is a multimodal upgraded version of the open source large language model ChatGLM-6B, which supports images, Chinese and English, and is released by the Knowledge Engineering and Data Mining Group of Tsinghua University. *Technology: VisualGLM-6B is a combination of language model ChatGLM-6B and image model BLP2-Qformer. The parameters after the combination of the two are 7.8 billion (6.2 billion + 1.6 billion). The pre-training dataset used by the model is 30 million high-quality "Chinese image-text" and 300 million "English image-text" pairs in the CogView dataset. In the fine-tuning stage, the model is trained on a long visual question answering dataset to generate answers that match human preferences.
  • Performance: According to DataLearner, VisualGLM-6B integrates model quantization technology, and users can deploy models locally on consumer-grade graphics cards. The INT4 quantization level only requires 8.7G of video memory. This means that even users with gaming laptops can quickly and privately deploy the model, a first for a ChatGPT-like model of this size.

UniDiffuser: UniDiffuser, a probabilistic modeling framework designed for multimodality

  • Team: The TSAIL team led by Professor Zhu Jun from the Department of Computer Science, Tsinghua University published a paper "One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale" on March 12, and conducted some multi-modal explorations.
  • Technology: UniDiffuser adopted the Transformer-based network architecture U-ViT proposed by the team, and trained a model with one billion parameters on the 5 billion parameter version of the open source large-scale graphic data set LAION, enabling it to be high-quality complete a variety of generation tasks.
  • Function: Simply put, in addition to the one-way vin-generated graph, the model can also realize multiple functions such as graph-generated text, graph-text joint generation, unconditional graph-text generation, graph-text rewriting, etc., and realizes the mutual transformation between arbitrary modes .

LLaVA: The performance of some instructions is comparable to GPT-4

  • Team: LLaVA, co-produced by University of Wisconsin-Madison, Microsoft Research, and Columbia University, has open-sourced code, models, and datasets on GitHub.
  • Technique: LLaVA is an end-to-end multimodal large model that connects a vision encoder and large language model for general vision and language understanding.
  • Function:
  • Text-based tasks: LLaVA can process and analyze text, allow users to ask questions, chat with users, or complete tasks entered by users, such as extracting document summaries, sentiment analysis, entity recognition, etc.
  • Image-based tasks: LLaVA can analyze images, describe images, perform object recognition, and analyze and understand scenes.
  • Performance: Early experiments show that LLaVA's multimodal chat capabilities can sometimes output comparable performance to GPT-4 on unseen images/commands, and comparable to GPT-4 on synthetic multimodal command-following datasets , obtaining a relative score of 85.1%.

MiniGPT-4: A multi-modal open source large model born out of LLaMA, the GPT-4 "replacement" for individual users

  • Team: The release of the multi-modal GPT-4 large model has pushed the public's enthusiasm for large models to a new climax. However, GPT-4 is not completely free to individuals. If you want to use GPT-4, you must either pass an official invitation or upgrade to a paid account. But even if you pay, some areas cannot purchase related services. In this environment, Deyao Zhu, Jun Chen and others from King Abdullah University of Science and Technology released MiniGPT-4 on April 23, aiming to combine visual information from pre-trained visual encoders with advanced large language models .
  • Technology: Specifically, MiniGPT-4 uses the same pre-trained vision component as BLIP-2, which consists of EVA-CLIP's ViT-G/14 and Q-Former, while using large language model Vicuna tuning , can perform various complex language tasks.
  • Function: MiniGPT-4 can realize many ways to play, such as uploading a photo of a seafood feast, you can get a recipe; upload a picture of a product rendering, you can get a copy with goods; HTML code. According to feedback from people who have used it, the overall effect of MiniGPT-4 is good, but the current support for Chinese needs to be improved.

mPLUG-Owl: Modular Multimodal Large Model

  • Team: mPLUG-Owl is the latest work of the mPLUG series of Alibaba DAMO Academy. It continues the modular training idea of the mPLUG series and migrates large language models into multimodal large models.
  • Technology: mPLUG-Owl uses CLIP ViT-L/14 as the basic visual module, uses the structure initialized by LLaMA as the text decoder, and uses the Perceiver Resampler structure similar to Flamingo to reorganize the visual features. Furthermore, mPLUG-Owl proposes a comprehensive test set Owl for vision-related instruction evaluation for the first time.
  • Function: mPLUG-Owl has strong multi-turn dialogue ability, reasoning ability and joke interpretation ability. In addition, the research team also observed that mPLUG-Owl has begun to show some unexpected capabilities, such as multi-image association, multi-language, text recognition and document understanding.
  • Performance: Experiments prove that mPLUG-Owl is superior to BLIP2, LLaVA, and MiniGPT4 in vision-related command response tasks.

5.2 Specialization: Downstream ecological force, fine-tuning the model for specific tasks

The open source of large models provides an excellent opportunity for the vigorous growth of downstream ecology. Under the development of subdivided industries, large models begin to be further developed on specific tasks and change human life. Since the launch of the open source large model LLaMA, downstream specialized models based on LLaMA pre-training model fine-tuning have begun to emerge, such as Huatuo in the field of medical consultation.

  • Team: Hua Tuo is a fine-tuning model of LLaMa instructions based on Chinese medical knowledge. It performs well at the level of intelligent interrogation and can generate some more reliable medical knowledge answers. In the biomedical domain, published large language model models perform poorly due to the lack of certain medical expertise corpora. On April 14, a team from Harbin Institute of Technology released Hua Tuo, an open-source intelligent consultation model for the medical field, obtained after fine-tuning the LLaMa model.
  • Technology: LLaMA has multiple versions including 7 billion to 65 billion parameters. In order to train faster and more efficiently and save training costs, Huatuo uses the 7 billion parameter version of LLaMA as the basic model. In order to ensure the accuracy of the model in answering questions in the medical field, the researchers extracted relevant medical knowledge from the Chinese medical knowledge map CMeKG, generated a variety of instruction data, and collected more than 8,000 instruction data for supervised fine-tuning to ensure that the model answer The factual correctness of the question.

  • Performance: In terms of model performance, HuaTuo is compared with three other benchmark models. In order to evaluate the model performance, the researchers recruited five professional physicians with medical background to evaluate in three dimensions of safety, usability, and stationarity (SUS). The SUS scale runs from 1 (unacceptable) to 3 (good), with 2 indicating an acceptable response. The average SUS score is shown in the graph below. The results show that the HuaTuo model significantly improves knowledge availability without sacrificing too much security.

Huatuo may be the paradigm for the development of specific task models downstream of the open source large model in the future, that is, use a small open source large model with low parameter volume as the basic model, and train with data from specific professional fields to obtain better performance segmentation domain model.

6. Investment Advice

The development of open-source large models has far-reaching impacts. This report selects some directions that may benefit, and draws market attention.

6.1 Microsoft: In-depth cooperation with OpenAI

We believe that in the short term, the ChatGPT system is still the most capable large model, and Microsoft will benefit from its in-depth cooperation.

  • Equity On, according to "Fortune" magazine reports, after the first batch of investors in OpenAI recover the initial capital, Microsoft will be entitled to 75% of OpenAI's profits until Microsoft recovers the investment cost ($13 billion); After OpenAI's $92 billion profit, Microsoft's share will drop to 49%. At the same time, other venture investors and OpenAI employees will also be entitled to 49% of OpenAI's profits until they earn about $150 billion. If those caps are reached, the shares of Microsoft and investors will be returned to the OpenAI nonprofit foundation.
  • On the product, in addition to allowing the search engine Bing to integrate ChatGPT, in January 2023, Microsoft announced the launch of the Azure OpenAI service. Azure Global Enterprise customers can directly call the OpenAI model on the cloud platform, including GPT3. 5. Codex and DALL.E models, shortly thereafter, Microsoft announced the integration of GPT4 into the new Bing and Office upgrade version Copilot.

6.2 Nvidia: Open-source large models drive the popularity of applications, and the demand for computing power is soaring

Computing power service is a direction with strong benefit and certainty in the wave of open source large models. It has a clear leading edge in the integration of software and hardware, and is the current leader in AI computing power.

6.2.1 The demand for computing power of super-large models will maintain high growth

The super-large model has outstanding quality advantages, and the market will continue to pursue it, and its demand for computing power will continue to grow. Super-large models have strong expressive power and high accuracy, and have advantages in quality, and the market will continue to pursue such models. The scale of super-large models, data sets, and daily activities continue to expand, and the required computing power will continue to increase.

6.2.2 The rapid catch-up of open source large models will also benefit computing power

In the short term, the market will take a wait-and-see attitude towards open source big models. Large open-source models have poor versatility and cannot compete with large-scale models in a short period of time. In addition, it is currently difficult to systematically evaluate the specific performance of models. The market is holding a wait-and-see attitude towards large open-source models, waiting for them to prove their performance and advantages.

** In the medium and long term, open source large models are expected to further improve performance, thereby occupying a larger share in the market. **Compared with super-large models, open-source large-scale models have lower computing power requirements and are easier to deploy. They can also be optimized for certain professional fields through quick fine-tuning and other methods, which are attractive and practical. In the medium to long term, if there is an open source large model that can approach or surpass the performance of ChatGPT in terms of quality, the market demand for such models may rise rapidly. Correspondingly, this type of computing power demand will quickly increase.

6.2.3 Catalyst: Development of Open Source Large Model Licenses, Standards and Capability Evaluation System

  • License: We believe that the long-developed license system in the open source community has enriched the choices of developers and helped large models choose their own licenses, thereby promoting commercial applications. The prosperity and development of large models will obviously drive the market demand for computing power.
  • Standards: We expect that the large model community may also produce standards similar to the Linux development standard LSB. Proper standardization will prevent the large model ecology from being too fragmented. We are optimistic about the continuous vitality of the open source community to promote the performance of computing power service providers such as Nvidia.
  • Large Model Capability Evaluation System: A credible large model capability evaluation system will help the market quickly distinguish large models and contribute to the development of large model tracks.

6.3 Meta: Open source "vanguard", benefiting from the open source ecology

Looking back on the development history of Android, we are optimistic about the Google-like role in the "Google-Android" system. In this system, Google, as the developer of the open source operating system Android, uses open source as a tool to stimulate the development of the upstream and downstream of the ecology, and enhance its proprietary Exposure of services to end customers.

Mapped to the large model, we believe that the open source Meta of LLaMA may deepen the cooperation with downstream large model development manufacturers through LLaMA, and sell the proprietary products in its own system to customers.

6.4 Other

6.4.1 Edge Computing Power + Open Source Model: Landing Accelerator for AI Applications

Edge computing power can place reasoning calculations on users' devices, which can not only improve the speed and efficiency of data processing, thereby reducing the cost of reasoning, but also protect the privacy and security of users.

  • Smart module: As the best model to carry edge computing power, it is the most deterministic and flexible variety under the heavy volume of future embodied smart products. It is recommended to pay attention to MeiG Intelligence and Fibocom.
  • Edge IDC: With its advantages in time delay and cost, it is an effective supplement to satisfy the "ladder-shaped" computing power distribution. It is recommended to pay attention to Longyu shares and Wangsu Technology.
  • Optical Module: Zhongji InnoLight, Xinyisheng, Tianfu Communication, Yuanjie Technology.
  • Traditional IoT communication chip manufacturers: It is expected to benefit from the upward process of the industry. It is recommended to pay attention to: ZTE, Fii, Tsinghua Unigroup, Ruijie Networks, Feiling Kesi, Aojie Technology, Chuling Information.

6.4.2 Big data companies: Optimistic about the combination of "open source large model + self-owned massive data"

For enterprises that "have a lot of data but insufficient computing power", using their own data to fully pre-train and fine-tune open-source commercial models is more cost-effective. This can improve the accuracy and applicability of the model, and can also greatly reduce the model training time and cost. In addition, the fine-tuned model can better meet the specific needs and business scenarios of the enterprise, thereby enhancing the competitiveness and innovation capabilities of the enterprise. With the continuous development and popularization of technology, independent fine-tuning models have become an important means for enterprises to use their own data to quickly realize intelligent applications.

6.4.3 Open Source Large Model Service Provider: Service First

Looking back at the development history of Red Hat, we believe that even if the large model enters the open source era, 24*7 customer-oriented services are still essential, especially for enterprises. We are optimistic about open source large model service providers.

6.4.4 Apple: Get ChatGPT App Revenue Share

ChatGPT is listed on the App Store, and according to the practice of the App Store, Apple will get a share of the revenue.

View Original
The content is for reference only, not a solicitation or offer. No investment, tax, or legal advice provided. See Disclaimer for more risks disclosure.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments