Author: Kunal Kejriwal

  • SHOW-O: A Single Transformer Uniting Multimodal Understanding and Generation

    SHOW-O: A Single Transformer Uniting Multimodal Understanding and Generation

    Significant advancements in large language models (LLMs) have inspired the development of multimodal large language models (MLLMs). Early MLLM efforts, such as LLaVA, MiniGPT-4, and InstructBLIP, demonstrate notable multimodal understanding capabilities. To integrate LLMs into multimodal domains, these studies explored projecting features from a pre-trained modality-specific encoder, such as CLIP, into the input space of

  • EAGLE: Exploring the Design Space for Multimodal Large Language Models with a Mixture of Encoders

    EAGLE: Exploring the Design Space for Multimodal Large Language Models with a Mixture of Encoders

    The ability to accurately interpret complex visual information is a crucial focus of multimodal large language models (MLLMs). Recent work shows that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. Several recent MLLMs achieve this by utilizing a mixture of vision encoders. Despite

  • Sapiens: Foundation for Human Vision Models

    Sapiens: Foundation for Human Vision Models

    The remarkable success of large-scale pretraining followed by task-specific fine-tuning for language modeling has established this approach as a standard practice. Similarly, computer vision methods are progressively embracing extensive data scales for pretraining. The emergence of large datasets, such as LAION5B, Instagram-3.5B, JFT-300M, LVD142M, Visual Genome, and YFCC100M, has enabled the exploration of a data

  • LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

    LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

    Current long-context large language models (LLMs) can process inputs up to 100,000 tokens, yet they struggle to generate outputs exceeding even a modest length of 2,000 words. Controlled experiments reveal that the model’s effective generation length is inherently limited by the examples seen during supervised fine-tuning (SFT). In other words, this output limitation stems from

  • SGLang: Efficient Execution of Structured Language Model Programs

    SGLang: Efficient Execution of Structured Language Model Programs

    Large language models (LLMs) are increasingly utilized for complex tasks requiring multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems for programming and executing these applications are lacking. SGLang, a newly introduced system, aims to address this by providing efficient execution of complex language model programs. SGLang comprises a frontend

  • DIAMOND: Visual Details Matter in Atari and Diffusion for World Modeling

    DIAMOND: Visual Details Matter in Atari and Diffusion for World Modeling

    It was in 2018, when the idea of reinforcement learning in the context of a neural network world model was first introduced, and soon, this fundamental principle was applied on world models. Some of the prominent models that implement reinforcement learning were the Dreamer framework, which introduced reinforcement learning from the latent space of a

  • In-Paint3D: Image Generation using Lightning Less Diffusion Models

    In-Paint3D: Image Generation using Lightning Less Diffusion Models

    The advent of deep generative AI models has significantly accelerated the development of AI with remarkable capabilities in natural language generation, 3D generation, image generation, and speech synthesis. 3D generative models have transformed numerous industries and applications, revolutionizing the current 3D production landscape. However, many current deep generative models encounter a common roadblock: complex wiring

  • MARKLLM: An Open-Source Toolkit for LLM Watermarking

    MARKLLM: An Open-Source Toolkit for LLM Watermarking

    LLM watermarking, which integrates imperceptible yet detectable signals within model outputs to identify text generated by LLMs, is vital for preventing the misuse of large language models. These watermarking techniques are mainly divided into two categories: the KGW Family and the Christ Family. The KGW Family modifies the logits produced by the LLM to create