How to inference with streaming response from Amazon Bedrock LLMs
In this tutorial, let's uncover how to inference with streaming response from Amazon Bedrock LLMs.
Amazon Bedrock offeres range of LLMs along with Stable Diffusion model to interact and inference with them. We inference the LLMs available on Amazon Bedrock with invoke_model_with_response_stream
api.
First let's install prerequisites. You are required to have AWS Console access and the model access from Amazon Bedrock. Learn how to set up your environment and model access from Getting Started with Amazon Bedrock.
Setup Environment
You need to configure AWS profile, for that you should have AWS CLI. Install or update the latest version of the AWS CLI form the official documentation here. Then you can refer to Configure the AWS CLI.
aws configure
AWS Access Key ID [None]: <insert_access_key>
AWS Secret Access Key [None]: <insert_secret_key>
Default region name [None]: <insert_aws_region>
Default output format [json]: json
Your credentials and configuration will be stored in ~/.aws/
on *nix based OS and %UserProfile%/.aws/
in Windows OS.
Run to verify your profile aws sts get-caller-identity
{
"UserId": "AIDA9EUKQEJHEWE27S1RA",
"Account": "012345678901",
"Arn": "arn:aws:iam::012345678901:user/dev-admin"
}
Now create repository for our development.
mkdir bedrock-streaming-demo
cd bedrock-streaming-demo
Let's create a virtual environment and install necessary packages.
python3 -m venv env
source env/bin/activate
Windows users:
python -m venv env
env\Scripts\activate
pip install -U pip
pip install boto3
Install packages for boto3
. We will be accessing Amazon Bedrock API which is a service on AWS. Hence we will be using latest version of boto3
to access Amazon Bedrock.
Finally we store the package info to requirements.txt
.
pip freeze > requirements.txt
Let's create a main.py
file to start implementing logic for interacting with Amazon Bedrock's Llama 2 Chat 13B LLM from provider Meta. It's base model ID is meta.llama2-13b-chat-v1
.
touch main.py
Now open the file main.py
and copy the following code.
import boto3
import json
llamaModelId = 'meta.llama2-13b-chat-v1'
prompt = "What is the difference between a llama and an alpaca?"
We are importing necessary packages and dfining model id and prompt.
Before we inference, lets implement a function to print the response stream from Amazon Bedrock to the output console.
def print_response_stream(response_stream):
event_stream = response_stream.get('body')
for b in iter(event_stream):
bc = b['chunk']['bytes']
gen = json.loads(bc.decode().replace("'", '"'))
line = gen.get('generation')
if '\n' == line:
print('')
continue
print(f'{line}', end='')
The response from the invoke_model_with_response_stream
will return type of dict which contains various parameters including the body that holds chunks of bytes of information encoded. So we iterate over the response body and decode to human readable text and print the text to the console.
Now let's create the payload to inference LLM on Amazon Bedrock.
llamaPayload = json.dumps({
'prompt': prompt,
'max_gen_len': 512,
'top_p': 0.9,
'temperature': 0.2
})
Our payload is json with various inference parameters. Using boto3
we obtain the runtime and then we invoke the model using invoke_model_with_response_stream
as follows.
bedrock_runtime = boto3.client(
service_name='bedrock-runtime',
region_name='us-east-1'
)
response = bedrock_runtime.invoke_model_with_response_stream(
body=llamaPayload,
modelId=llamaModelId,
accept='application/json',
contentType='application/json'
)
The modelId is meta.llama2-13b-chat-v1
and you can choose from the range of LLMs listed here (Note Stable Diffusion is a diffusion model not language model).
Finally we print the response obtained from invoke_model_with_response_stream
by sending it to the method we define earlier to print response stream of the output of invoke_model_with_response_stream
to the console as follows.
print_response_stream(response)
The output would stream to the console as follows.
Full code
import boto3
import json
llamaModelId = 'meta.llama2-13b-chat-v1'
prompt = "What is the difference between a llama and an alpaca?"
def print_response_stream(response_stream):
event_stream = response_stream.get('body')
for b in iter(event_stream):
bc = b['chunk']['bytes']
gen = json.loads(bc.decode().replace("'", '"'))
line = gen.get('generation')
if '\n' == line:
print('')
continue
print(f'{line}', end='')
llamaPayload = json.dumps({
'prompt': prompt,
'max_gen_len': 512,
'top_p': 0.9,
'temperature': 0.2
})
bedrock_runtime = boto3.client(
service_name='bedrock-runtime',
region_name='us-east-1'
)
response = bedrock_runtime.invoke_model_with_response_stream(
body=llamaPayload,
modelId=llamaModelId,
accept='application/json',
contentType='application/json'
)
print_response_stream(response)
Summary
In this tutorial, we learnt how to inference Amazon Bedrock LLMs with streaming response. We understood how easy it is to implement streaming response with invoke_model_with_response_stream
and learnt how to parse and print the streaming response of invoke_model_with_response_stream
to the console using our custom implemented method.
Please bookmark 🔖 this post and please share this with your friends and colleagues.