Skip to content

Create transcription

Create transactions

Transcribes audio files into text using relaxAI’s advanced speech-to-text models. This endpoint supports various audio formats and provides options for customization, including language selection and speaker diarization.

POST https://api.relax.ai/v1/audio/transactions

Example Request

from openai import OpenAI
client = OpenAI(
api_key = RELAX_API_KEY,
base_url = 'https://api.relax.ai/v1/',
)
audio_file = open("speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
model="Voxtral-Small-24B-2507",
file=audio_file
)
print(transcript.text)
import { OpenAI } from "openai";
const openai = new OpenAI({
apiKey: RELAX_API_KEY,
baseUrl: 'https://api.relax.ai/v1/'
});
async function main() {
const transcription = await openai.audio.transcriptions.create({
file: fs.createReadStream("audio.mp3"),
model: "Voxtral-Small-24B-2507",
});
console.log(transcription.text);
}
main();
Terminal window
curl https://api.relax.ai/v1/chat/completions \
-H "Authorization: Bearer $RELAX_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F file="@/path/to/file/audio.mp3" \
-F model="Voxtral-Small-24B-2507e"

Response

Returns a 200 OK response code with a JSON object containing the transcription details.


Transcription Response
{
"text": "In this video, we will explore conversation dialogues between two friends...",
"logprobs": null,
"usage": {
"input_tokens": 3000,
"output_tokens": 111,
"total_tokens": 3111,
"type": "duration",
"input_token_details": {
"audio_tokens": 3000,
"text_tokens": 111
},
"seconds": 60
},
"duration": 0,
"language": "",
"segments": null,
"words": null
}

Request Body

The following parameters can be included in the request body:


Transcription Request Body

file

  • Type: file
  • Required: Yes
  • Description: The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

model

  • Type: string
  • Required: Yes
  • Description: ID of the model to use. The options are Voxtral-Small-24B-2507.

known_speaker_names

  • Type: array
  • Required: No
  • Description: Optional list of speaker names that correspond to the audio samples provided in known_speaker_references[]. Each entry should be a short identifier (for example customer or agent). Up to 4 speakers are supported.

known_speaker_references

  • Type: array
  • Required: No
  • Description: Optional list of audio samples (as data URLs) that contain known speaker references matching known_speaker_names[]. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by file.

language

  • Type: string
  • Required: No
  • Description: The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.

prompt

  • Type: string
  • Required: No
  • Description: An optional text to guide the model’s style or continue a previous audio segment. The prompt should match the audio language. This field is not supported when using gpt-4o-transcribe-diarize.

response_format

  • Type: string
  • Required: No
  • Default: json
  • Description: The format of the output, in one of these options: json, text, srt, verbose_json, vtt, or diarized_json.

stream

  • Type: boolean
  • Required: No
  • Default: false
  • Description: If set to true, the model response data will be streamed to the client as it is generated using server-sent events. See the Streaming section of the Speech-to-Text guide for more information.

temperature

  • Type: number
  • Required: No
  • Default: 0
  • Description: The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

timestamp_granularities

  • Type: array
  • Required: No
  • Description: The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. Currently only word is supported.