Voice-AI

One of KISSKI’s standout offerings is its AI-based transcription and captioning service, Voice-AI. Utilizing High-Performance Computing (HPC) infrastructure, Voice-AI leverages the Whisper (large-v3) or faster whisper model to transcribe audio and generate video captions swiftly. Trained on 680,000 hours of labeled data, Whisper rivals professional human transcribers in performance, offering reliable automatic speech recognition (ASR) and speech translation across various datasets and domains. Users can choose between tasks such as transcription and translation to suit their needs, and notably, this KISSKI service will be available for free.

Service Components

This service is composed of two main parts:

  1. Handling Uploaded Audio: Processes audio files uploaded by users (<500MB and >500 MB).
  2. Handling Streaming Audio from BBB: Captures and processes streaming audio from BigBlueButton (BBB).

Key Features

Uploaded Audio Transcription/Translation Service:

  • English audio transcription with attached timestamps (depending on the chosen output format).
  • Non-English audio transcription, supporting multiple languages, including German.
  • Audio translation from various languages to English.

For audio file transcription or translation, users need to:

  • Choose the input audio language.
  • Select the output format (Normal text/SRT/VTT).
  • Upload audio files in formats such as mp3, mp4, or wav.
  • Choose between transcription or translation action.
  • The output will be displayed in the result box (if the audio file is >500 MB, it can be downloaded when the transcription is ready).

Streaming Audio Transcription Service:

  • Provides transcription in the automatically detected language in BBB meetings.
  • Enhances communication and inclusivity in meetings, especially in noisy environments.
  • Makes hybrid lectures more accessible, supports individuals who are deaf or hard of hearing, assists in learning new languages, facilitates note-taking and content review, and keeps participants attentive and engaged.

This service will transform accessibility and communication, making meetings, lectures, and various other interactions more efficient and inclusive. Given that most meetings are held in BigBlueButton (BBB), we have integrated this Voice-AI service into these meetings on demand, to caption the meeting audio and provide the captions in shared notes.

The BBB service provided by GWDG is a productive and high-capacity offering that serves as a video conferencing tool. It’s not just limited to video conferencing, but it’s also suitable for e-learning scenarios. The tool runs on HTML5, operates in the browser, and doesn’t require any software installation. It’s a great tool for collaborative work, data exchange, and live-streaming.

For this, users need to:

  • Provide the BBB room address.
  • Enter the access key if there is any.
  • Optionally, add terms or words that need to be written correctly in the summary output.
  • The transcription of the meeting will be written in the Etherpad Lite URL that will be created upon starting the transcription, and the summary of the meeting will appear both in the provided pad and the summary box.

Etherpad Lite is an open-source, real-time collaborative editor that allows multiple users to edit documents simultaneously in their browsers.

Ensuring Privacy and Flexibility

Security is essential when dealing with highly important and confidential information, as it provides reliability, demonstrates compliance during audits or regulatory inspections, and ensures business continuity in the event of a system failure or attack. User privacy is a cornerstone of this service. As a result, our service does not save your audio/conversation or results at any point on persistent storage, except for BBB transcriptions, which are stored in local MySQL, and >500 MB audio file transcription results in our s3 storage, both of which will be erased after 30 days. The number of requests for either of services per user and the respective timestamps are recorded so we can monitor the system’s usage and perform accounting.

Web interface and usage

If you have an AcademicCloud account, the web interface can also easily be reached here. Web Interface Example Web Interface Example Web Interface Example Web Interface Example

From the web interface, there are built-in actions that needs to be filled for service to be functional. These include:

Audio File Transcription:

  • Input language: Choose the language of the uploaded audio for transcription.
  • Text format: Choose the format of the output, which can be text, SRT, or VTT.
  • Choose file: Upload your audio file, which can be wav, mp3, or mp4.
  • Light/Dark mode (sun/moon button): Toggle between light and dark mode.
  • Footer: Includes “Privacy Policy”, “Terms of use”, “FAQ”, Contact, and the option to switch between English and German.

Web Interface Example Web Interface Example

Streaming Audio Transcription:

  • Room address: Add your BBB room address.
  • Access key: Add your access key or moderation key if available.
  • Correction: Add words or terms that may be written incorrectly; these will be written correctly in the transcription summary.
  • Start transcription: Upon starting transcription, you will receive a notification in the BBB room and also get a pad URL in the summary box.
  • Stop transcription: Upon clicking this, you will receive the meeting summary.
  • Light/Dark mode (sun/moon button): Toggle between light and dark mode.
  • Footer: Includes “Privacy Policy”, “Terms of use”, “FAQ”, Contact, and the option to switch between English and German.

Acknowledgement

Jakob Hördt for writing the proxy.

Marcel Hellkamp for writing the bbb audio captioning code.

Ali Doost Hosseini for Kong gateway.

Johannes Biermann for technical support.

Author

Narges Lux

Further services

If you have questions, please browse the FAQ first. If you have more specific questions, feel free to contact us at support@gwdg.de.