How to inference with streaming response from Amazon Bedrock LLMs

In this tutorial, let's uncover how to inference with streaming response from Amazon Bedrock LLMs.

Amazon Bedrock offeres range of LLMs along with Stable Diffusion model to interact and inference with them. We inference the LLMs available on Amazon Bedrock with invoke_model_with_response_stream api.

First let's install prerequisites. You are required to have AWS Console access and the model access from Amazon Bedrock. Learn how to set up your environment and model access from Getting Started with Amazon Bedrock.

Setup Environment

You need to configure AWS profile, for that you should have AWS CLI. Install or update the latest version of the AWS CLI form the official documentation here. Then you can refer to Configure the AWS CLI.

aws configure

AWS Access Key ID [None]: <insert_access_key>
AWS Secret Access Key [None]: <insert_secret_key>
Default region name [None]: <insert_aws_region>
Default output format [json]: json

Your credentials and configuration will be stored in ~/.aws/ on *nix based OS and %UserProfile%/.aws/ in Windows OS.

Run to verify your profile aws sts get-caller-identity

{
    "UserId": "AIDA9EUKQEJHEWE27S1RA",
    "Account": "012345678901",
    "Arn": "arn:aws:iam::012345678901:user/dev-admin"
}

Now create repository for our development.

mkdir bedrock-streaming-demo
cd bedrock-streaming-demo

Let's create a virtual environment and install necessary packages.

python3 -m venv env
source env/bin/activate

Windows users:

python -m venv env
env\Scripts\activate
pip install -U pip
pip install boto3

Install packages for boto3. We will be accessing Amazon Bedrock API which is a service on AWS. Hence we will be using latest version of boto3 to access Amazon Bedrock.

Finally we store the package info to requirements.txt.

pip freeze > requirements.txt

Let's create a main.py file to start implementing logic for interacting with Amazon Bedrock's Llama 2 Chat 13B LLM from provider Meta. It's base model ID is meta.llama2-13b-chat-v1.

touch main.py

Now open the file main.py and copy the following code.

import boto3
import json

llamaModelId = 'meta.llama2-13b-chat-v1' 
prompt = "What is the difference between a llama and an alpaca?"

We are importing necessary packages and dfining model id and prompt.

Before we inference, lets implement a function to print the response stream from Amazon Bedrock to the output console.

def print_response_stream(response_stream):
    event_stream = response_stream.get('body')
    for b in iter(event_stream):
        bc = b['chunk']['bytes']
        gen = json.loads(bc.decode().replace("'", '"'))
        line = gen.get('generation')
        if '\n' == line:
            print('')
            continue
        print(f'{line}', end='')

The response from the invoke_model_with_response_stream will return type of dict which contains various parameters including the body that holds chunks of bytes of information encoded. So we iterate over the response body and decode to human readable text and print the text to the console.

Now let's create the payload to inference LLM on Amazon Bedrock.

llamaPayload = json.dumps({ 
	'prompt': prompt,
    'max_gen_len': 512,
	'top_p': 0.9,
	'temperature': 0.2
})

Our payload is json with various inference parameters. Using boto3 we obtain the runtime and then we invoke the model using invoke_model_with_response_stream as follows.

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime', 
    region_name='us-east-1'
)

response = bedrock_runtime.invoke_model_with_response_stream(
    body=llamaPayload, 
    modelId=llamaModelId, 
    accept='application/json', 
    contentType='application/json'
)

The modelId is meta.llama2-13b-chat-v1 and you can choose from the range of LLMs listed here (Note Stable Diffusion is a diffusion model not language model).

Finally we print the response obtained from invoke_model_with_response_stream by sending it to the method we define earlier to print response stream of the output of invoke_model_with_response_stream to the console as follows.

print_response_stream(response)

The output would stream to the console as follows.

Full code

import boto3
import json

llamaModelId = 'meta.llama2-13b-chat-v1' 
prompt = "What is the difference between a llama and an alpaca?"

def print_response_stream(response_stream):
    event_stream = response_stream.get('body')
    for b in iter(event_stream):
        bc = b['chunk']['bytes']
        gen = json.loads(bc.decode().replace("'", '"'))
        line = gen.get('generation')
        if '\n' == line:
            print('')
            continue
        print(f'{line}', end='')

llamaPayload = json.dumps({ 
	'prompt': prompt,
    'max_gen_len': 512,
	'top_p': 0.9,
	'temperature': 0.2
})

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime', 
    region_name='us-east-1'
)
response = bedrock_runtime.invoke_model_with_response_stream(
    body=llamaPayload, 
    modelId=llamaModelId, 
    accept='application/json', 
    contentType='application/json'
)
print_response_stream(response)

Summary

In this tutorial, we learnt how to inference Amazon Bedrock LLMs with streaming response. We understood how easy it is to implement streaming response with invoke_model_with_response_stream and learnt how to parse and print the streaming response of invoke_model_with_response_stream to the console using our custom implemented method.

Please bookmark 🔖 this post and please share this with your friends and colleagues.