The global market for generative AI is nearing a significant turning point, currently valued at USD 8 billion and projected to grow at a CAGR of 34.6% by 2030. With over 85 million jobs anticipated to remain vacant by that year, the need to enhance operations through AI and automation is critical to achieving the efficiency, effectiveness, and experiences that business leaders and stakeholders demand. Since the release of ChatGPT, some of the notable names in the AI revolution are BERT, GPT-3, DALL-E 2, LLaMA, and BLOOM.  And all of these are Foundation models. To learn what foundation models are in generative AI, read through the lines carefully.
What is the Foundational Model in Generative AI?
Foundation models are extensive AI systems trained on vast amounts of unlabeled data through self-supervised learning. This training process creates versatile models that can execute a variety of tasks with impressive accuracy, such as image classification, natural language processing, and question answering. A foundation model is essentially a neural network pre-trained on extensive datasets. 

Get curriculum highlights, career paths, industry insights and accelerate your technology journey.
Download brochure
Example of Foundation Models
A notable example of a foundation model is Florence, developed by Microsoft. This model powers Azure AI Vision’s production-ready computer vision services, enabling capabilities such as image analysis, text recognition, and face identification using pre-built image tagging.
Foundation Models Layers
Unlike traditional models that are built from scratch for particular tasks, foundation models utilise a layered training approach:
 
- Base Layer: This involves generic pre-training on extensive data, enabling the model to learn from diverse content, including text and images.
- Middle Layer: This stage involves domain-specific refinement, honing the model’s focus on particular areas.
- Top Layer: This final layer fine-tunes the model’s performance for specific applications, such as text generation, image recognition, or other AI tasks.
 
Prior to the advent of these large foundation models, the common practice was to develop models for specific tasks from scratch. This method was resource-intensive, time-consuming, and depended heavily on large labelled datasets.
Importance of Foundational Model
There are three main reasons why Foundation models are essential:
 
- Unified Solution: Foundation models are incredibly powerful, eliminating the need to train separate models for different tasks. One single model can now address multiple problems.
- Simplified Training: Training foundation models are straightforward because they do not rely on labelled data. Minimal effort is needed to adapt them to specific tasks.
- Task Agnosticism: Without foundation models, achieving high performance for specific tasks would require vast amounts of labelled data. However, foundation models only need a few examples to be tailored to a given task. We will soon delve into the details of how to utilise foundation models for our purposes.
- High Performance: Foundation models enable the creation of high-performance models for various tasks. The leading architects in Natural Language Processing and Computer Vision are built upon foundation models.
Types of Foundation Models
There are two categories of foundation models:
 
- Large Language Models (LLMs).
- Diffusion Models.
Large Language Models (LLMs)
Large Language Models (LLMs) are intricate machine learning models or  systems crafted to comprehend and generate text resembling human language using deep learning methodologies. These systems undergo intensive training on extensive datasets of text, empowering them to accomplish diverse language-centric tasks such as translation, summarization, and question-answering. Transformer-based ALMs such as GPT-3, BERT, and RoBERTa have attracted considerable interest owing to their outstanding proficiency in natural language processing assignments.
Diffusion Models
Diffusion Models, on the other hand, specialise in generating data resembling their training input. These models operate by injecting Gaussian noise into the training data and subsequently learning to reverse this noise process to reconstruct the original data. Diffusion models have showcased promising outcomes across numerous applications, such as image and speech synthesis. Renowned for their ability to produce high-fidelity samples with intricate details, notable diffusion models include Dall-E, Imagen, and Glide.
How are Foundation Models Trained?
Foundation models are trained on unlabeled datasets using a self-supervised approach. In self-supervised learning, there are no explicitly labelled datasets; instead, labels are generated automatically from the dataset itself, and the model is trained in a supervised manner. This is the key difference between supervised learning and self-supervised learning.
 
Various foundation models exist in NLP and computer vision, but the core training principle remains similar. Let’s delve into the training process.
Large Language Models (LLMs) serve as the foundation models for NLP. These models, though differing in architecture, share a common learning objective: predicting missing tokens in a sentence. The missing token can either be the next token or be located anywhere within the text.
 
Based on their learning objectives, LLMs can be categorised into two types: Causal LLMs and Masked LLMs. For instance, GPT is a Causal LLM trained to predict the next token in the text, while BERT is a Masked LLM designed to predict missing tokens scattered throughout the text.
Imagine having an image of a dog and applying Gaussian noise to it, resulting in a blurry image. Repeating this process multiple times eventually produces an image that is completely noise-filled and unrecognisable. Diffusion models specialise in reversing this process, transforming the noisy image back into its original form.
 
In simple terms, diffusion models learn to denoise images through a two-step process: Forward Diffusion and Reverse Diffusion.
 
- Forward Diffusion: During this step, the training image is progressively turned into an unrecognisable state through a fixed process that doesn’t require a learning network, unlike Variational Autoencoders (VAEs), which involve jointly training an encoder and decoder to convert an image to a latent space and back.
 
- Reverse Diffusion: This is where the actual learning occurs. The unrecognisable image is converted back to its original form through a single network trained to reverse the noise. This process is akin to reversing the diffusion of ink in water to its original state.
Challenges with Foundation Models
Foundation models possess the capability to intelligently respond to prompts even on topics they haven’t explicitly been trained on. However, they encounter several challenges:
 
- Infrastructure requirements: Creating a foundation model from scratch demands substantial financial investment and extensive resources, with training periods stretching over months.
 
- Front-end development: Developers aiming for practical application must integrate foundation models into a software stack, incorporating tools for prompt construction, fine-tuning, and pipeline development.
 
- Lack of comprehension: Despite their ability to furnish grammatically and factually accurate responses, foundation models struggle to grasp the contextual nuances of a prompt. Moreover, they lack social and psychological awareness.
 
- Unreliable responses: Answers provided by foundation models regarding certain subjects can be unreliable, occasionally veering towards inappropriate, toxic, or erroneous responses.
 
- Bias: Foundation models are susceptible to bias, potentially absorbing hate speech and inappropriate connotations from their training data. To mitigate this risk, developers should meticulously curate training datasets and embed specific norms into their models.
Wrapping Things Up
Foundation models are fuelling the current generative AI boom. The potential applications are so vast that every sector and industry, including data science, is likely to be affected by the adoption of Ai in the coming future. To provide yourself with a comprehensive and accessible resource for you to keep updated with AI development, join Integrated Program in Data Science, Artificial Intelligence & Machine Learning. 
FAQs
Foundation models utilise self-supervised learning to generate labels directly from input data. This approach means the model hasn't been explicitly trained with labelled datasets. This characteristic distinguishes foundation models from earlier machine learning architectures that rely on supervised or unsupervised learning. To select suitable models for your project, identify one that supports your specific task and the language of the text you need to process. Additionally, attributes such as licensing, pre training data, model size, and the fine-tuning process should be considered.
Generative AI (GenAI) is a type of artificial intelligence that can produce content, such as text, images, music, or videos. A language model (LLM) is a specific type of AI designed to comprehend, generate, or complete text sequences, making it a generative model specifically tailored for language-related tasks. Updated on October 1, 2024