Voice-AI
One of KISSKI’s standout offerings is its AI-based transcription and captioning service, Voice-AI. Utilizing High-Performance Computing (HPC) infrastructure, Voice-AI leverages the Whisper (large-v2) to transcribe audio and generate video captions swiftly. Trained on 680,000 hours of labeled data, Whisper rivals professional human transcribers in performance, offering reliable automatic speech recognition (ASR) and speech translation across various datasets and domains. Users can choose between tasks such as transcription and translation to suit their needs, and notably, this KISSKI service will be available for free.
You need an Academic Cloud account to access the AI Services. Use the federated login or create a new account. Details are on this page.
Service Components
This service is composed of two main parts:
- Handling Uploaded Audio: Processes audio files uploaded by users (<500 MB).
- Handling Streaming Audio: Captures and processes streaming audio from browser (This part will be available in the future).
Audio File Transcription/Translation Service:
If you have an AcademicCloud account, the web interface can also easily be reached here.
The platform offers intuitive, built-in features designed for seamless audio processing:
- Input language: Choose the language of the uploaded audio for transcription.
- Text format: Choose the format of the output, which can be text, SRT, or VTT.
- Choose file: Upload your audio file, which can be wav, mp3, or mp4.
- Delete Output: Instantly and permanently remove the transcription result.
- Light/Dark mode (sun/moon button): Toggle between light and dark mode.
- Footer: Includes “Privacy Policy”, “Terms of use”, “FAQ”, Contact, and the option to switch between English and German.
Core Capabilities
- English audio transcription with attached timestamps (depending on the chosen output format).
- Non-English audio transcription, supporting multiple languages, including German.
- Audio translation from various languages to English.
How to Use the Service
- Choose the input audio language.
- Select the output format (Normal text/SRT/VTT).
- Upload audio files in various formats such as mp3, mp4, flac, or wav.
- Choose between transcription or translation action.
- The output can be downloaded when the transcription is ready.
Streaming Audio Transcription Service (This part will be available in the future):
This browser-based tool provides real-time transcription or English translation during meetings and lectures, enhancing clarity, accessibility, and engagement—especially in noisy or multilingual environments. It supports deaf and hard-of-hearing participants, language learners, and anyone needing better note-taking or content review. No installation is required, and it works in any browser. Transcriptions and summaries are written to a shared Etherpad Lite URL generated at session start, enabling collaborative editing and review. Etherpad Lite is an open-source editor that allows multiple users to work on the same document simultaneously, making communication more inclusive and efficient.
Intuitive, built-in features includes:
- Start Session: Begins transcription and generates a pad URL.
- Stop Session: Ends the current session.
- Mode: Choose between transcription or translation to English.
- Spoken Language: Defaults to auto-detect, or manually select a language.
- Subtitle Overlay: Opens a detached window to display subtitles over any webpage.
- Finalize & Summarize: Generates a summary directly in the pad.
- Light/Dark Mode: Toggle between light and dark themes (sun/moon icon).
- Footer: Includes links to Privacy Policy, Terms of Use, Imprint, FAQ, Help, Contact, and language switch (English/German).
Ensuring Privacy and Flexibility
We prioritize security to ensure reliability, regulatory compliance, and business continuity. User privacy is central to our design. Audio and conversation inputs are deleted immediately after transcription or translation.
Exceptions:
- Audio transcription results: stored on our data mover node, erased after 30 days. Voice AI outputs can also be deleted instantly and permenantly via a dedicated delete button.
- Live transcription results (future feature): stored in MySQL, auto-deleted after 24 hours.
- Usage Logging: We record request counts and timestamps per user for system monitoring and accounting.
Acknowledgement
Jakob Hördt for writing the proxy. Marcel Hellkamp for writing the bbb audio captioning code. Ali Doost Hosseini for Kong gateway. Johannes Biermann for technical support.
Author
Narges Lux
Further services
If you have questions, please browse the FAQ first. If you have more specific questions, feel free to contact us at support@gwdg.de.