IBM Releases Granite 3.3 8B Speech Recognition Model
IBM has released Granite Speech 3.3 8B, a speech-to-text (STT) model for automatic speech recognition (ASR) and automatic speech translation (AST) built on top of Granite 3.3 8B Instruct, the latest update to IBM's enterprise large language model (LLM).
Alongside enhanced reasoning capabilities, the Granite 3.3 Instruct models now offer fill-in-the-middle (FIM) capabilities in addition to standard next-token prediction.
To enhance Granite-driven applications, IBM is also releasing a suite of retrieval augmented generation (RAG)-focused LoRA adapters for Granite 3.2. IBM Research has also developed a series of activated LoRAs , an experimental low-rank adaption (LoRA) that cuts inference costs and memory requirements while enabling seamless switching between adapters.
All Granite models and tools are released open source under a standard Apache 2.0 license. The models and associated tools are available on Hugging Face. Granite 3.3 Instruct is also available on IBM watsonx.ai, as well as through platform partners including LMStudio, Ollama, and Replicate.
Joining Granite Speech 3.3 8B is Granite 3.3 8B Instruct, the large language model (LLM) that serves as its foundation, and its smaller (2B) counterpart. The enhanced sophistication of the text models' reasoning process and addition of fill-in-the-middle (FIM) capabilities facilitate a wider array of use cases.
IBM is also releasing an updated and expanded series of LoRA adapters for the previously released Granite 3.2 8B Instruct model through Granite Experiments, an IBM Research playground for testing open-source ideas. Further LoRA innovations, including a suite of adapters for Granite 3.3 Instruct, will be launched in the coming weeks.
Granite Speech 3.3 provides automated translation from English to languages that include French, Spanish, Italian, German, Portuguese, Japanese, and Mandarin Chinese.
Architecturally, Granite Speech 3.3 consists of the following:
- A speech encoder, comprising 10 conformer blocks trained with Connectionist Temporal Classification (CTC on ASR-focused datasets.
- A speech projector, in this case, a 2-layer query transformer, Q-former, that projects audio embeddings to a space where they can be interpreted by an LLM.
- The Granite 3.3 8B Instruct LLM with 128K context length.
- LoRA adapters, applied to the LLM's query and value projection matrices when audio data is present.
In contrast to directly integrated models that combine speech and text in a single pass, Granite Speech 3.3 uses a two-pass design. To ask the model questions about an audio file requires an initial call to transcribe the audio and a second prompt to query the model about that transcribed text. If a prompt contains the token and a corresponding .wav file, Granite Speech will engage the audio encoder, projector, and LoRA adapter. If not, the model will simply run in text mode using Granite 3.3 Instruct 8B.
This two-pass approach ensures that Granite Speech 3.3 8B's performance on text queries mirrors that of its underlying Granite 3.3 8B Instruct LLM,
Granite Speech 3.3 can accept inputs of arbitrary length.
The latest versions of IBM's text-only models, Granite 3.3 8B Instruct and Granite 3.3 2B Instruct, add fill-in-the-middle (FIM) capabilities. IBM is also releasing their base model counterparts, Granite 3.3 8B Base and Granite 3.3 2B Base, for developers to fine-tune their projects.
IBM says its focus for Granite 3.2 was enriching the Instruct models' reasoning abilities through Thought Preference Optimization (TPO) to follow complex instructions without sacrificing general performance Built on an updated Granite 3.3 base model and fine-tuned through multi-stage reinforcement learning using TPO and Group Relative Policy Optimization (GRPO), both Granite 3.3 Instruct models demonstrated significant improvement on benchmarks conventionally associated with reasoning, according to IBM.
As with the Granite 3.2 Instruct models, thinking can be toggled on and off, allowing developers to prioritize enhanced chain-of-thought (CoT) reasoning when they need it and prioritize cost-efficiency and low latency when they don't.
To enhance existing Granite-based applications and inform development of the next generation of performance-enhancing LoRA adapters, IBM is also releasing a collection of 5 (mostly) RAG-specific LoRA adapters for Granite 3.2 8B Instruct through Granite Experiments, an IBM Research playground for testing open-source ideas. Each of these LoRA adapters leverages the model's intrinsic knowledge to enable a specific task, such as rewriting retrieval queries or detecting hallucinations.
IBM Research developed these conventional LoRA adapters alongside counterparts for each that use a new kind of low-rank adaption IBM calls activated LoRAs (aLoRAs). These aLoRAs reuse key-value (KV) caches, avoiding the need to recompute the context (or prefill) again. Activated LoRAs match the generation quality of standard LoRAs while providing significant runtime and compute advantages, IBM contends.
When equipped with the RAG Hallucination Detection LoRA, the model will provide a faithfulness score between 0 and 1 (in increments of 0.1), reflecting how closely its output reflects the information contained within the retrieved documents.
With the Query Rewrite LoRA equipped, the model will automatically rewrite non-standalone user queries into fully self-contained queries. The model will pass the users first query as is, but rewrite the second query. In testing, this rewriting increased the relevance of model responses by as much as 21 percent.
When equipped with the RAG Citation Generaton LoRA, the model will generate a citation for each sentence of its output (if that sentence was informed by external sources). Each sentence-level citation not only notes the sources referenced but also contains a set of sentences from the cited sources that support the model's corresponding output sentence.
When equipped with the RAG Answerability Prediction LoRA, the model will determine whether the user's query can be answered using the information available in connected documents.
For each model output, the Uncertainty LoRA AI model calibration research enables the model to generate a quantized certainty score ranging from 0 to 9 (representing 5 percent to 95 percent certainty, respectively). The score essentially reflects the extent to which the model's response is supported by information contained within its training data.
IBM has also proposed using these LoRAs in workflows that leverage multiple LoRA adapters across multiple inferences. Users can first implement Query Rewrite to quickly rewrite initial prompts for optimal retriever accuracy. Once the model's retrieval-augmented response has been generated using the rewritten prompt, you might then implement RAG Hallucination Detection to verify an appropriate level of faithfulness to the information in the retrieved documents. If the faithfulness score falls beneath an acceptable threshold, the workflow could direct the model to resample the response until the faithfulness score exceeds that threshold. Once hallucinations are no longer detected, users could then engage RAG Citations for the final response provided to the user.