Voice-AI

One of KISSKI’s standout offerings is its AI-based transcription and captioning service, Voice-AI. Utilizing High-Performance Computing (HPC) infrastructure, Voice-AI leverages the Whisper (large-v2) to transcribe audio and generate video captions swiftly. Trained on 680,000 hours of labeled data, Whisper rivals professional human transcribers in performance, offering reliable automatic speech recognition (ASR) and speech translation across various datasets and domains. Users can choose between tasks such as transcription and translation to suit their needs, and notably, this KISSKI service will be available for free.

Service Components

This service is composed of two main parts:

Handling Uploaded Audio: Processes audio files uploaded by users (<1500 MB).
Handling Streaming Audio from BBB: Captures and processes streaming audio from BigBlueButton (BBB) (This part will be available in the future).

Key Features

Uploaded Audio Transcription/Translation Service:

English audio transcription with attached timestamps (depending on the chosen output format).
Non-English audio transcription, supporting multiple languages, including German.
Audio translation from various languages to English.

For audio file transcription or translation, users need to:

Choose the input audio language.
Select the output format (Normal text/SRT/VTT).
Upload audio files in formats such as mp3, mp4, or wav.
Choose between transcription or translation action.
The output can be downloaded when the transcription is ready.

Streaming Audio Transcription Service (This part will be available in the future):

Provides transcription in the automatically detected language in BBB meetings.
Enhances communication and inclusivity in meetings, especially in noisy environments.
Makes hybrid lectures more accessible, supports individuals who are deaf or hard of hearing, assists in learning new languages, facilitates note-taking and content review, and keeps participants attentive and engaged.

This service will transform accessibility and communication, making meetings, lectures, and various other interactions more efficient and inclusive. Given that most meetings are held in BigBlueButton (BBB), we have integrated this Voice-AI service into these meetings on demand, to caption the meeting audio and provide the captions in shared notes.

The BBB service provided by GWDG is a productive and high-capacity offering that serves as a video conferencing tool. It’s not just limited to video conferencing, but it’s also suitable for e-learning scenarios. The tool runs on HTML5, operates in the browser, and doesn’t require any software installation. It’s a great tool for collaborative work, data exchange, and live-streaming.

For this, users need to:

Provide the BBB room address.
Enter the access key if there is any.
Optionally, add terms or words that need to be written correctly in the summary output.
The transcription of the meeting will be written in the Etherpad Lite URL that will be created upon starting the transcription, and the summary of the meeting will appear both in the provided pad and the summary box.

Etherpad Lite is an open-source, real-time collaborative editor that allows multiple users to edit documents simultaneously in their browsers.

Ensuring Privacy and Flexibility

Security is essential when dealing with highly important and confidential information, as it provides reliability, demonstrates compliance during audits or regulatory inspections, and ensures business continuity in the event of a system failure or attack. User privacy is a cornerstone of this service. As a result, our service does not save your audio/conversation input on persistent storage (they immidiatly get deleted after transcription/translation is done), except for BBB transcriptions, which are stored in local MySQL (future service), and audio file transcription results in our data mover node, both of which will be erased after 30 days. The number of requests for either of the services per user and the respective timestamps are recorded so we can monitor the system’s usage and perform accounting.

Web interface and usage

If you have an AcademicCloud account, the web interface can also easily be reached here.

From the web interface, there are built-in actions that needs to be filled for service to be functional. These include:

Audio File Transcription:

Input language: Choose the language of the uploaded audio for transcription.
Text format: Choose the format of the output, which can be text, SRT, or VTT.
Choose file: Upload your audio file, which can be wav, mp3, or mp4.
Light/Dark mode (sun/moon button): Toggle between light and dark mode.
Footer: Includes “Privacy Policy”, “Terms of use”, “FAQ”, Contact, and the option to switch between English and German.

(Future service)

Streaming Audio Transcription:

Room address: Add your BBB room address.
Access key: Add your access key or moderation key if available.
Correction: Add words or terms that may be written incorrectly; these will be written correctly in the transcription summary.
Start transcription: Upon starting transcription, you will receive a notification in the BBB room and also get a pad URL in the summary box.
Stop transcription: Upon clicking this, you will receive the meeting summary.
Light/Dark mode (sun/moon button): Toggle between light and dark mode.
Footer: Includes “Privacy Policy”, “Terms of use”, “FAQ”, Contact, and the option to switch between English and German.

Acknowledgement

Jakob Hördt for writing the proxy.

Marcel Hellkamp for writing the bbb audio captioning code.

Ali Doost Hosseini for Kong gateway.

Johannes Biermann for technical support.

Author

Narges Lux

Further services

If you have questions, please browse the FAQ first. If you have more specific questions, feel free to contact us at support@gwdg.de.