Sarvam-1: India's First LLM Optimized for Indian Languages Achieves Breakthroughs in AI
AI Articles

Sarvam-1: India's First LLM Optimized for Indian Languages Achieves Breakthroughs in AI

Sarvam AI has made a significant mark in the AI landscape with the launch of Sarvam-1, India’s first Large Language Model (LLM) tailored specifically for Indian languages. Despite its relatively compact size, Sarvam-1 showcases exceptional performance in various Indic language benchmarks, surpassing many larger models in accuracy and efficiency. With support for 10 major Indian languages and enhanced capabilities in cross-lingual tasks, Sarvam-1 is designed to address the language diversity of India. Its computational efficiency and ability to run on edge devices make it a game-changer for broader AI deployment in India.

Sarvam-1: Bridging the Gap for Indian Languages in AI

India has witnessed a new stride in the field of artificial intelligence with the launch of Sarvam-1, a Large Language Model (LLM) uniquely optimized for Indian languages. Developed by Sarvam AI with 2 billion parameters, Sarvam-1 supports 10 Indian languages—Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu—in addition to English, marking a critical advancement in natural language processing (NLP) for India’s multilingual populace.


Superior Performance in Indic Language Tasks

Despite its smaller parameter count compared to other models, Sarvam-1 has demonstrated strong performance in benchmark evaluations like MMLU, ARC-Challenge, and IndicGenBench. It has even outperformed larger models, such as Gemma-2 and Llama-3.1 8B, particularly in Indic language tasks.

  1. On the TriviaQA benchmark, Sarvam-1 achieved an accuracy of 86.11% across Indic languages, significantly surpassing Llama-3.1 8B’s score of 61.47%.
  2. Its performance on IndicGenBench, which measures cross-lingual capabilities like summarization, translation, and question answering, was equally impressive. It recorded an average chrF++ score of 46.81 on the Flores dataset for English-to-Indic translation, outperforming larger models.


Enhanced Tokenization for Indian Languages

One of Sarvam-1's standout features is its ability to efficiently handle Indic scripts, addressing a significant challenge that has historically limited other multilingual LLMs. Most existing models suffer from high token fertility, requiring more tokens per word for Indian languages compared to English. This inefficiency results in slower processing and less optimal model performance.

  1. Sarvam-1’s advanced tokenizer significantly reduces this inefficiency, achieving fertility rates of 1.4 to 2.1 tokens per word, much closer to the 1.4 tokens typical for English.
  2. This streamlined tokenization process enables more efficient training and improved performance across diverse Indian languages, making Sarvam-1 a valuable tool for applications ranging from translation to conversational AI.


A Rich and Diverse Training Corpus: Sarvam-2T

Sarvam-1's training is based on the Sarvam-2T corpus, which consists of approximately 2 trillion tokens. The dataset is meticulously curated to include a balanced representation of the 10 supported Indian languages, with Hindi comprising around 20% of the dataset. It also includes a significant portion of English and programming languages, allowing the model to excel in both monolingual and multilingual tasks.

  1. The Sarvam-2T dataset emphasizes high-quality and diverse data, addressing the gaps found in existing Indic datasets like Sangraha. Unlike other datasets that often rely on web-crawled content, Sarvam-2T incorporates longer documents and scientific and technical content, enhancing the model’s capacity for complex reasoning tasks.
  2. This focus on quality helps Sarvam-1 deliver more accurate and nuanced responses, making it a reliable choice for business applications, academic research, and government services that require advanced language processing.


Unmatched Computational Efficiency

Beyond its linguistic capabilities, Sarvam-1 is designed for computational efficiency, offering 4 to 6 times faster inference speeds than larger models like Gemma-2-9B and Llama-3.1-8B. This makes it particularly suitable for deployment in production environments, including edge devices where computational resources may be limited.

  1. Sarvam-1's efficiency ensures faster response times in real-world applications, enabling seamless user experiences in areas like voice assistants, customer support automation, and real-time translation.
  2. Its compact architecture allows businesses and developers to integrate advanced NLP capabilities into their products without the need for extensive computational resources.


Powering AI Advancements with Cutting-Edge Infrastructure

The training of Sarvam-1 was conducted over five days using 1,024 GPUs on Yotta’s Shakti cluster, leveraging NVIDIA’s NeMo framework for training optimizations. This infrastructure allowed Sarvam AI to maximize the model’s potential while keeping training times short. The high-performance training environment is indicative of the model's robust design, making it a scalable solution for various industry applications.


Available for Open-Source Development

In a move towards fostering an open-source AI ecosystem, Sarvam-1 is available for download on Hugging Face’s model hub. Developers can explore and leverage its capabilities for a range of Indic language applications, from chatbots and translation services to sentiment analysis and content generation.

This open availability encourages the AI developer community to build upon Sarvam-1’s capabilities, fostering innovation in Indian AI applications and helping address the digital divide for non-English speakers.


A New Era for AI in India

With Sarvam-1, India has taken a significant step forward in democratizing access to advanced AI for its diverse linguistic landscape. The model addresses the long-standing challenges of developing AI solutions for low-resource languages, creating opportunities for businesses, startups, and public institutions to engage with Indian communities in their native languages.

As Sarvam-1 gains traction in the AI landscape, it highlights the potential for AI-driven transformation in India, enabling inclusive digital growth and language preservation through cutting-edge NLP. This development is poised to transform how Indian businesses and institutions leverage AI, making advanced language understanding and cross-lingual communication more accessible than ever before.


With its focus on performance, efficiency, and linguistic inclusivity, Sarvam-1 is not just another LLM—it is a breakthrough for India’s AI ecosystem. The model represents a promising future where AI can truly understand and engage with India's diverse linguistic heritage, offering smarter solutions that cater to the needs of millions of Indian language speakers.


Source: Analytics Indiamag / Chat GPT