Skip to main content

Input & Output Modalities

Modalities allow your experts to interact through multiple channels beyond text. Configure voice input (speech-to-text) and audio output (text-to-speech) to create rich, multimodal conversational experiences.

Overview

B-Bot Hub supports:
  • Input Modalities: How users communicate with your expert (voice, text, files)
  • Output Modalities: How your expert responds (text, voice, images)

Accessing Modality Settings

Configure modalities when:
  • Creating an expert (Step 5: Models)
  • Creating an assistant (Model configuration step)
  • In chat settings (during conversations)
Modalities Configuration

Input Modalities

Voice Input (Speech-to-Text)

Enable voice input to allow users to speak to your expert instead of typing.
1

Enable Voice Input

Click Configure Input Modalities in the model selection step
2

Select Provider

Choose your speech-to-text provider:
  • Browser Native: Uses the browser’s built-in speech recognition (free)
  • OpenAI Whisper: High-accuracy transcription
  • Google Speech-to-Text: Multi-language support
  • Azure Speech: Enterprise-grade recognition
3

Configure Settings

  • Language: Select primary language for recognition
  • Continuous: Enable continuous listening mode
  • Interim Results: Show transcription as user speaks
4

Test

Use the test button to verify voice input is working

Supported Input Types

Voice

Real-time voice recording and transcription

Files

Upload documents, images, audio files

Text

Traditional text input (always available)

Output Modalities

Text-to-Speech (Audio Output)

Enable audio output to have your expert speak responses aloud.
1

Open Output Modalities

Click Configure Output Modalities in the model settings
2

Choose TTS Provider

Select from available providers:
  • OpenAI TTS: Natural-sounding voices with emotion
  • ElevenLabs: Ultra-realistic voice synthesis
  • Google TTS: WaveNet voices, multi-language
  • Azure Speech: Neural voices with customization
  • Browser Native: Built-in browser synthesis (free, basic)
3

Select Voice

Each provider offers different voices:
  • OpenAI: Alloy, Echo, Fable, Onyx, Nova, Shimmer
  • ElevenLabs: 100+ premium voices
  • Google: Standard and WaveNet voices
  • Azure: Neural voices in 100+ languages
4

Configure Playback

  • Auto-play: Automatically play audio when response completes
  • Streaming TTS: Stream audio as text generates (where supported)
  • Speed: Adjust playback speed (0.5x to 2.0x)

Voice Configuration

Models:
  • tts-1: Standard quality, fast
  • tts-1-hd: High definition, slower
Voices:
  • Alloy: Neutral, balanced
  • Echo: Male, clear
  • Fable: British accent, expressive
  • Onyx: Deep, authoritative
  • Nova: Female, energetic
  • Shimmer: Soft, warm
Features:
  • Real-time streaming
  • Multiple languages
  • Emotion in voice
  • Fast generation
Best for: General use, conversational AI

Advanced Configuration

API Key Selection

You can use different API keys for TTS than your main model key!
For example:
Main Model: GPT-4 (Production OpenAI key)
Voice Output: ElevenLabs (Personal ElevenLabs key)
This allows you to:
  • Separate billing for different services
  • Use specialized accounts for voice
  • Manage rate limits independently

Streaming TTS

Streaming TTS generates and plays audio as the text is being generated, rather than waiting for the complete response.Benefits:
  • Faster time-to-first-audio
  • More natural conversation flow
  • Better user experience
  • Reduced perceived latency
Currently streaming TTS is supported by:
  • ✅ OpenAI TTS
  • ✅ ElevenLabs (with turbo models)
  • ⚠️ Google TTS (partial support)
  • ❌ Azure Speech (coming soon)
  • ❌ Browser Native (not supported)
  1. Open Output Modalities configuration
  2. Select a supported provider
  3. Toggle “Stream audio as text generates”
  4. Save configuration
The audio will now start playing before the full response is complete.

Auto-Play Settings

Audio plays automatically for every response.Best for:
  • Voice-first applications
  • Accessibility features
  • Hands-free use cases
  • Customer service bots

Using Voice in Chat

Voice Input

1

Enable Voice Mode

Click the microphone icon in the chat input area to enable voice mode
2

Hold to Record

Press and hold the microphone button while speaking
3

Release to Send

Release the button when done. Your speech will be transcribed and sent automatically.
Tips for better recognition:
  • Speak clearly and at a normal pace
  • Minimize background noise
  • Use a good microphone
  • Wait for the transcription to complete

Audio Output

When TTS is enabled:
  1. Expert’s text response appears as normal
  2. Audio player appears below the message
  3. If auto-play is on, audio starts automatically
  4. Controls available: play/pause, speed, volume

Multimodal Content

Your experts can handle multiple content types in a single message:

Voice + Text

User speaks a question, expert responds with text and audio

Image + Voice

User uploads an image and asks about it via voice

File + Text

User uploads a document and types a question

Mixed Media

Combine any input types in a single interaction

Best Practices

Voice Selection

Choose voices that align with your expert’s role:
  • Professional: Clear, authoritative (Onyx, Echo)
  • Friendly: Warm, approachable (Nova, Shimmer)
  • Technical: Neutral, precise (Alloy)
  • Customer Service: Empathetic, patient (Fable)
  • Global: Use multi-language TTS providers
  • Accessibility: Enable voice input and output by default
  • Professional: Use high-quality voices (ElevenLabs, Azure Neural)
  • Cost-Conscious: Browser native or OpenAI standard
Voice performance varies:
  • Desktop browsers have better native support
  • Mobile may have data usage considerations
  • Test auto-play on mobile (may be blocked)
  • Consider bandwidth limitations

Performance Optimization

Speed vs Quality

Fast (tts-1, turbo):
  • Lower latency
  • Good for chat
  • Less compute intensive
High Quality (tts-1-hd, neural):
  • Better sound
  • More natural
  • Slightly slower

Streaming vs Buffered

Streaming:
  • Faster start
  • Better UX
  • More complex
Buffered:
  • Complete audio
  • Simpler
  • Small delay

Troubleshooting

Check:
  1. Browser permissions granted?
  2. Microphone connected and working?
  3. Try browser native option first
  4. Check browser console for errors
Common fixes:
  • Reload page and grant permissions
  • Check system microphone settings
  • Try a different browser
  • Use HTTPS (required for mic access)
Check:
  1. Volume not muted?
  2. Auto-play enabled?
  3. Provider API key valid?
  4. Browser allows audio playback?
Common fixes:
  • Click play button manually
  • Check provider key in settings
  • Try different TTS provider
  • Check browser audio settings
Solutions:
  • Switch to HD model (tts-1-hd)
  • Try ElevenLabs for premium quality
  • Use neural voices (Azure, Google)
  • Check internet connection speed
  • Reduce playback speed if garbled
Cost-saving tips:
  • Use browser native for testing
  • Choose standard models over HD
  • Disable auto-play (user controlled)
  • Use OpenAI over ElevenLabs for lower cost
  • Monitor usage in provider dashboard

Cost Comparison

ProviderQualitySpeedCost (per 1M chars)Best For
Browser⭐⭐⚡⚡⚡FreeTesting, demos
OpenAI tts-1⭐⭐⭐⚡⚡⚡$15General use
OpenAI tts-1-hd⭐⭐⭐⭐⚡⚡$30High quality
ElevenLabs⭐⭐⭐⭐⭐⚡⚡$30-120Premium
Google Standard⭐⭐⭐⚡⚡$4Budget
Google WaveNet⭐⭐⭐⭐⚡⚡$16Value
Azure Neural⭐⭐⭐⭐⚡⚡$15Enterprise
Prices are approximate and may vary. Check provider websites for current pricing.

Next Steps

Provider Keys

Set up API keys for voice providers

Create Expert

Create an expert with voice capabilities

Chat Features

Learn about using voice in conversations

Custom Models

Configure custom voice models