021Tech - 24/7 Multimodal WhatsApp Customer Support Agent

Faris Sharafli - CEO

Project Overview

Tech021 AI designed and implemented a state-of-the-art 24/7 Multimodal WhatsApp Customer Support Agent, a highly flexible and robust solution for handling diverse customer interactions. This agent transforms raw incoming WhatsApp messages—including text, audio, video, and images—into actionable data, uses a sophisticated AI agent to generate context-aware responses, and automatically replies to the user.

The system is built on a modular workflow that dynamically processes different message types. Key components include a Media Router for classifying content, specialized processors for handling Audio, Video, and Images, an AI Agent for complex natural language understanding and response generation, and a RAG (Retrieval-Augmented Generation) system leveraging buffered memory and external knowledge sources like Wikipedia. This architecture allows the agent to deliver fast, accurate, and contextually rich support, dramatically reducing the load on human support teams and ensuring customers receive instant assistance 24 hours a day.

Phase 1 - WhatsApp Trigger & Media Router

Trigger: USER WhatsApp Message
The workflow is activated instantly upon receipt of any new message sent to the designated WhatsApp business number.

Key Capabilities: Split Out Message Parts
The initial step parses the incoming WhatsApp message, extracting the message body, metadata, and crucially, determining if the message contains text, an audio file, a video, or an image.

Redirect Message Types
This acts as the central router, directing the workflow to the correct processing branch based on the message content (e.g., text messages go to "Text Summarizer," images go to "Analyze Image Message").

Phase 2: Multimodal Content Processing

1. Audio Messages
Transcribes user voice notes into text, requiring the AI Speach to text tool service for conversion.

2. Video Messages
Downloads the video and uses an AI model for advanced video analysis and content summarization
(e.g., describing the issue shown in a video).

3. Image Messages
Downloads the image and uses an AI Model's visual capabilities to describe the image content
(e.g., reading text from a screenshot or describing a damaged product).

4. Text Messages
Extracts the text and uses an AI Model for initial summarization or intent classification before sending it to the core AI agent.

Phase 3: AI Response Generation

1. Get User’s Message
A consolidation point that receives the processed text output from any of the multimodal branches (transcribed audio, summarized video, described image, or raw text).


2. AI Agent
The central brain of the system, this node leverages a powerful OpenAI model to interpret the user's need, access knowledge, and formulate the final response.

3. AI Chat
Maintains a conversational history (Chat Buffer Memory), allowing the agent to remember past turns and provide contextually relevant responses throughout a dialogue.

4. RAG/Knowledge Retrieval
Integrates external knowledge sources such as a Vector Database (for custom SOPs and knowledge base articles) and Wikipedia (for general facts), ensuring the agent’s response is factually accurate and aligned with the company’s policies.

Phase 4: Respond to WhatsApp User

1. Respond to WhatsApp User
This final action node sends the text response, along with any necessary media (documents, videos, etc.) back to the customer's WhatsApp number, completing the interaction loop.


2. Notify customer support
If the AI is unable to give the user a good answer, it will notify the customer support team in order for them to get into contact with the user.

Return on Investment

By implementing this Multimodal WhatsApp Customer Support Agent, True Horizon AI delivers a transformative customer service solution that:

Offers True 24/7 Availability:
This final action node sends the text response, along with any necessary media (documents, videos, etc.) back to the customer's WhatsApp number, completing the interaction loop.

Achieves Full Multimodal Support:
Handles all popular WhatsApp message types (text, audio, video, image), catering to every customer preference.

Reduces Operational Costs:
Automates up to 80% of routine and complex support inquiries, significantly lowering the reliance on human agents.

Enhances Customer Satisfaction:
Delivers immediate, accurate, and context-aware responses, leading to higher customer retention and satisfaction scores.

Scales Seamlessly:
The modular, cloud-native architecture ensures the solution can handle vast message volumes and easily integrate new AI models or knowledge sources as business needs evolve.