Enterprise LLM Chosen

For answer questions, a LLaMA model comparable to GPT-4, such as Guanaco-QLoRA, is typically selected. For summarize content, a comprehensive model based on BERT, like RoBERTa, is generally chosen.

What is LLMs

Large Language Models (LLMs) are foundation models that utilize deep learning in natural language processing (NLP) and natural language generation (NLG) tasks. They are designed to learn the complexity and linkages of language by being pre-trained on vast amounts of data. This pre-training involves techniques such as fine-tuning, in-context learning, and zero/one/few-shot learning, allowing these models to be adapted for certain specific tasks. They have the ability to understand context, generate human like text and perform different tasks.

Why do LLMs matter to Enterprises?

LLMs are the powerful tool for the organization to enhance customer experiences and provides valuable insights from unstructured data.

Enhanced customer experiences
Improved decision-making
Increased operational efficiency
Enhanced innovation
Competitive advantage

Developing An LLM Strategy

Key steps in developing an LLM strategy include:

Identifying use cases
Allocating resources
Ensuring ethical and responsible AI
Partnering with AI experts
Continual evaluation and improvement

Most popular Enterprise LLM Models

LLM	Company	Description
BERT	Google	Bidirectional Encoder Representations from Transformers (BERT) is a family of language models introduced in 2018 by researchers at Google. BERT is based on the transformer architecture.
RoBERTa	Facebook	Robustly-optimized BERT approach (RoBERTa) introduced by the Facebook. It based on the google BERT. RoBERTa includes additional pre-training improvements that achieve state-of-the-art results on several benchmarks, using only unlabeled text from the world-wide web, with minimal fine-tuning and no data augmentation.
LLaMA	Meta	LLaMA (Large Language Model Meta AI) is a large language model (LLM) released by Meta AI in February 2023. A variety of model sizes were trained ranging from 7 billion to 65 billion parameters.

There still have lots of other models, such as: ALBERT, ELECTRA, DistilBERT, XLNet, T5, ChatGLM etc。

Criteria for choose an LLM

The following lists the criteria for the enterprise to choose a suitable LLM

Published company: big company creates a better and stable product due to strong team.
Technology: Based on native technologies, the more popular models of native technologies are easier to solve problems and expand due to their friendly ecosystem.
Complex: Choose simple LLM model within business needs。
Cost and hardware: Better chosen an LLM which can be run in Nivada4090 than A100.
License: Apache and MIT open source license are the higher priority than commercial license.

Steps for an LLM training

Training an AI large language model (LLM) like LLaMA involves several key steps:

Data Collection: The first step in training an AI model is gathering a large and diverse dataset. For language models, this typically includes a wide range of text. However, the model doesn't know specifics about which documents were part of its training set or access any specific documents or sources.
Tokenization: This involves breaking down the input text into smaller pieces. The choice of tokenization can affect the performance of the model and is often language-dependent.
Pre-training: This is the initial phase of training a large language model. Pre-training helps the model learn the basic grammar and facts about the world, as well as some amount of reasoning ability. However, because this stage is unsupervised, the model might not be particularly good at any one specific task after pre-training alone.
Fine-tuning: After pre-training, the model is further fine-tuned on a narrower dataset, often with human oversight. This process helps the model generalize from the broad patterns it learned during pre-training to specific tasks or domains.
Evaluation and Testing: The trained model is evaluated on various metrics to ensure it meets the required performance standards. Testing is often performed on a holdout set of data that the model has never seen during training to check for overfitting.

Training large language models require significant computational resources, as well as careful monitoring to ensure they're learning in a way that's ethical and aligned with human values. If you do not have enough resources and experience to train your own LLMs, you can consider using pre-trained models or third-party APIs.

General dataset and format

1: SQuAD (Stanford Question Answering Dataset)

SQuAD is a widely used deep learning question-answering dataset.SQuAD focuses on the task of question answering. It tests a model’s ability to read a passage of text and then answer questions about it. The SQuAD dataset is typically in JSON format and includes the following fields:

1. version: dataset version.
2. data： a set of title and paragraph.
3. title: the article title.
4. paragraphs： a set of context and question/answer.

- context: paragraph content.
- qas: a set of questions and answers.
	- question: the content of question.
	- id: the identify of the question.
	- answers: A list of answers
		- text: the content of answer.
		- answer_start: the start position of the answer.

The following example lists the JSON data format which based on the SQuAD data.

{
    "version": "1.1",
    "data": [{
        "title": "Super Bowl 50",
        "paragraphs": [{
            "context": "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24-10 to earn their third Super Bowl title.",
            "qas": [{
                "question": "Which NFL team won Super Bowl 50?",
                "id": "56be4db0acb8001400a502ec",
                "answers": [{
                    "text": "Denver Broncos",
                    "answer_start": 163
                }]
            },
            {
                "question": "What was the score of Super Bowl 50?",
                "id": "56be4db0acb8001400a502ed",
                "answers": [{
                    "text": "24-10",
                    "answer_start": 189
                }]
            }]
        }]
    }]
}

2. Stanford Alpaca: alpaca_data.json

It contains 52K instruction-following data we used for fine-tuning the Alpaca model. This JSON file is a list of dictionaries, each dictionary contains the following fields:

instruction: it describes the task the model should perform. Each of the 52K instructions is unique.
input: optional context or input for the task. For example, when the instruction is "Summarize the following article
output: the answer to the instruction as generated by text-davinci-003.

This is the an example JSON data:

[{
    "instruction": "Rewrite the following sentence in the third person",
    "input": "I am anxious",
    "output": "She is anxious."
}]

There are some other datasets and formats, you only need to find the format required by the model you choose, or you can do your own proprietary dataset according to the format required by the model and business.

Fine-tuning

Please refer to the relevant resources such as Huggingface Models, Huggingface Transformers Examples and the corresponding GitHub repository of the model, as well as documentation for PyTorch and TensorFlow to train a small model for your specific business needs.

Local Deployment

If you need to deploy and use your own trained small model locally, please refer to the relevant documentation on Huggingface Models, Huggingface Transformers, he corresponding GitHub repository of the model, as well as documentation for PyTorch and TensorFlow.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a well-known pre-trained language model developed by researchers at Google in 2018. BERT is an open source machine learning framework for natural language processing (NLP). BERT is designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish context.

BERT is based on Transformers, a deep learning model in which every output element is connected to every input element, and the weightings between them are dynamically calculated based upon their connection. Using this bidirectional capability, BERT is pre-trained on two different, but related, NLP tasks: Masked Language Modeling and Next Sentence Prediction.

BERT is pre-trained on a massive amount of text data from wiki. During pre-training, BERT learns to predict masked words in sentences and determine the next sentence in a pair. This process helps BERT learn contextual representations of words.

After pre-training, BERT is fine-tuned on specific NLP tasks such as sentiment analysis, abstract summarization, question answering, etc., to adapt to specific tasks. Fine-tuning involves training BERT on labeled data specific to the task, allowing it to specialize for the task. This transfer learning approach enables BERT to leverage its general language understanding capability and apply it to various NLP tasks without requiring extensive task-specific training.

Many organizations are fine-tuning the BERT model architecture with supervised training to either optimize it for efficiency or specialize it for certain tasks by pre-training it with certain contextual representations.

RoBERTa

RoBERTa (Robustly Optimized BERT approach) is an improved and optimized version of BERT, proposed by Facebook AI in 2019. The goal of RoBERTa is to enhance the performance of BERT by using a larger model size, longer training time, and richer data.

Some of the improvements in RoBERTa compared to BERT include:

Larger model size: RoBERTa increases the model capacity by using more parameters and deeper layers, improving its representation power.
Longer training time: RoBERTa utilizes longer training time during the pre-training phase, allowing the model to learn language representations more thoroughly.
Dynamic masking: RoBERTa employs a dynamic masking strategy during pre-training, where random words in the text are masked and replaced with different words in each iteration, forcing the model to better understand the context.
Richer data: RoBERTa is trained on a larger of text data compared to BERT, enabling the model to learn a wider range of language features and patterns.

These improvements in RoBERTa result in significant performance gains across various natural language processing tasks. It surpasses BERT in many benchmark tests and achieves state-of-the-art results in several NLP competitions.

LLaMA

LLaMA(Large Language Model Meta AI) is a collection of state-of-the-art foundation language models ranging from 7B to 65B parameters. These models are smaller in size while delivering exceptional performance, significantly reducing the computational power and resources needed to experiment with novel methodologies, validate the work of others, and explore innovative use cases.

LLaMA, an auto-regressive language model, is built on the transformer architecture. Like other prominent language models, LLaMA functions by taking a sequence of words as input and predicting the next word, recursively generating text.

In terms of performance, LLaMA shows its impressive capabilities. The LLaMA model with 13 billion parameters can better perform than GPT-3 (which has 175 billion parameters) on most benchmarks, and it can run on a single V100 GPU. Furthermore, the largest LLaMA model with 65 billion parameters can rival Google's Chinchilla-70B and PaLM-540B models.

Here lists various improved models based on LLaMA that have been fine-tuned for specific tasks or applications.

LLM	Company	Description
Alpaca+LoRA	Stanford University	Alpaca is an AI language model developed by a team of researchers from Stanford University. It uses LLaMA, which is Meta's large-scale language model. It uses OpenAI's GPT (text-davinci-003) to fine-tune the 7B parameters-sized LLaMA model.
Vicuna	UC berkeley、CMU、Stanford	Vicuna is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. It is an auto-regressive language model, based on the transformer architecture.
Koala	Berkeley AI Research Institute (BAIR)	Koala is a version of LlaMA fine-tuned on dialogue data scraped from the web and public datasets, including high-quality responses to user queries from other large language models, as well as question-answering datasets and human feedback datasets.
Guanaco+QLoRA	University of Washington	Guanaco is an advanced instruction-following language model built on Meta's LLaMA 7B model. It enables the fine-tuning of large language models on a single GPU.