Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

StreamUni is a framework that efficiently enables unified Large Speech-Language Models to accomplish streaming speech translation in a cohesive manner.

Notifications You must be signed in to change notification settings

ictnlp/StreamUni

Repository files navigation

StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model

paper model model

Shoutao Guo, Xiang Li, Mengge Liu, Wei Chen Yang Feng*

Image description

StreamUni is a framework that efficiently enables unified Large Speech-Language Models to accomplish streaming speech translation in a cohesive manner. Experimental results demonstrate that StreamUni efficiently achieves state-of-the-art performance on streaming speech translation tasks across multiple directions.

Our method achieves the state-of-the-art performance on Streaming En-De task and Simultaneous En-Zh task.

🔥 Quick Start

Requirements

  • Install packages:

    pip install -r requirements.txt
    # For Stream Evaluation
    wget https://www-i6.informatik.rwth-aachen.de/web/Software/mwerSegmenter.tar.gz
    tar -zxvf your_file.tar.gz

Fine-tuning Model

  • You can fine-tune the Phi-4-Multimodal by running `bash fintune/finetune.sh':
    MODEL_NAME=model_dir
    VOICE_DIR=train_json_dir
    OUTPUT_DIR=StreamUni_model
    BATCH_SIZE=32
    BATCH_SIZE_PER_GPU=2
    NUM_EPOCHS=1
    LEARNING_RATE=4e-5
    WEIGHT_DECAY=0.01
    deepspeed \
     --include localhost:0,1,2,3,4,5,6,7 \
     --master_port $MASTER_PORT \
     speech_finetune.py \
     --deepspeed zero2.json \
     --model_name_or_path $MODEL_NAME \
     --voice_dir $VOICE_DIR \
     --output_dir $OUTPUT_DIR \
     --batch_size $BATCH_SIZE \
     --batch_size_per_gpu $BATCH_SIZE_PER_GPU \
     --learning_rate $LEARNING_RATE \
     --wd $WEIGHT_DECAY \
     --use_flash_attention
    We provide an example train_json_dir in fintune/train_example.json

Inference and Evaluation

  • You can run streaming speech translation inference by running `bash inference/infer.sh':
    MODEL_DIR="model_dir"
    CHUNK_LENGTH=640
    QUEUE_SIZE=3
    WAIT_K=5
    INSTRUCTION='Transcribe the audio to text, and then translate the audio to German. Use <sep> as a separator between the original transcript and the translation.'
    JSON_DIR="json_dir"
    OUTPUT_DIR="output_dir"
    LANG_PAIR="en_de"
    python stream_st_infer.py \
     --model_path "$MODEL_DIR" \
     --chunk_length $CHUNK_LENGTH \
     --queue_size $QUEUE_SIZE \
     --wait_k $WAIT_K \
     --cot_instruction "$INSTRUCTION" \
     --infer_json "$JSON_DIR" \
     --output_dir "$OUTPUT_DIR" \
     --lang_pair "$LANG_PAIR"

We provide an example json_dir in inference/example_infer.json After running inference scripts, we can obtain the output results, whose example is in inference/example_infer.json. Then we can evaluate the results.

  • You can run evaluation to get the Stream LAAL, Stream SacreBLEU, Stream COMET, Document-level SareBLEU and Document-level COMET by running `bash inference/eval.sh':

     OUTPUT_DIR=output_dir
     OUTPUT_FILE=output_file
     SEG_SOURCE_FILE=seg_source_file
     SEG_TARGET_FILE=seg_target_file
     STREAM_SOURCE_FILE=stream_source_file
     STREAM_TARGET_FILE=stream_target_file
     
     python latency_cal.py --directory $OUTPUT_DIR --file_name $OUTPUT_FILE
     
     cd $OUTPUT_DIR
     
     echo "Stream BLEU: "
     sacrebleu $SEG_TARGET_FILE -i segment_translation.txt -m bleu -b -w 4 -lc
     
     echo "Stream COMET: "
     comet-score -s $SEG_SOURCE_FILE -t segment_translation.txt -r $SEG_TARGET_FILE --model comet-22/model.ckpt
     
     echo "Document BLEU: "
     sacrebleu $STREAM_TARGET_FILE -i stream_translation.txt -m bleu -b -w 4 -lc
     
     echo "Document COMET: "
     comet-score -s $STREAM_SOURCE_FILE -t stream_translation.txt -r $STREAM_TARGET_FILE --model comet-22/model.ckpt

🖋Citation

If you have any questions, please feel free to submit an issue or contact guoshoutao22z@ict.ac.cn.

If our work is useful for you, please cite as:

@misc{guo2025streamuniachievingstreamingspeech,
 title={StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model}, 
 author={Shoutao Guo and Xiang Li and Mengge Liu and Wei Chen and Yang Feng},
 year={2025},
 eprint={2507.07803},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2507.07803}, 
}

AltStyle によって変換されたページ (->オリジナル) /