A financial services company is developing a real-time generative AI (GenAI) assistant to support human call center agents. The GenAI assistant must transcribe live customer speech, analyze context, and provide incremental suggestions to call center agents while a customer is still speaking. To preserve responsiveness, the GenAI assistant must maintain end-to-end latency under 1 second from speech to initial response display. The architecture must use only managed AWS services and must support bidirectional streaming to ensure that call center agents receive updates in real time. Which solution will meet these requirements?
Answer · B
Option B is the only solution that satisfies all strict real-time, streaming, and latency requirements. Amazon Transcribe streaming with partial results allows transcription fragments to be delivered before the speaker finishes a sentence. This significantly reduces perceived latency and enables downstream processing to begin immediately, which is essential for maintaining sub-1-second end-to-end response times. Using Amazon Bedrock’s InvokeModelWithResponseStream API enables token-level or chunk-level streaming responses from the foundation model. This allows the GenAI assistant to begin delivering @@OPT@@A@@ @@OPT@@B@@ @@OPT@@C@@ @@OPT@@D@@ suggestions to call center agents incrementally instead of waiting for a full model response. This streaming inference capability is critical for interactive, real-time agent assistance use cases. Amazon API Gateway WebSocket APIs provide fully managed, bidirectional communication between backend services and agent dashboards. This ensures that updates flow continuously to agents as new transcription fragments and model outputs become available, preserving real-time responsiveness without requiring custom socket infrastructure. Option A introduces additional synchronous processing layers and storage writes that increase latency. Option C uses batch transcription and post-call processing, which cannot meet real-time requirements. Option D uses embeddings and asynchronous messaging, which are not suitable for live incremental suggestions and bidirectional streaming. Therefore, Option B best aligns with AWS real-time GenAI architecture patterns by combining streaming transcription, streaming model inference, and managed bidirectional communication while maintaining low latency and operational simplicity.