Student preview

Processing Audio with Whisper and Managing State

You've already built a skill to process text messages from Telegram. Now, let's enhance it to handle audio messages and prevent duplicate entries. By the end of this lesson, you'll be able to integrate Whisper for audio transcription into your AI skills and manage processing state to avoid duplicate work.

Core idea

When building AI agents that interact with external services, you often encounter two challenges: processing different data types (like audio) and ensuring you only process new data. For audio, you need a tool to convert speech to text. For avoiding duplicates, you need a way to track what's already been processed.

Whisper is an open-source speech-to-text model developed by OpenAI. It can transcribe audio in many languages and works locally on your machine, meaning your audio data doesn't leave your computer. When you instruct Claude Code to "transcribe audio using Whisper," it will automatically download and run the necessary model. The first time you use it, this download might take a few minutes.

To prevent reprocessing old data, you can leverage unique identifiers provided by the external service. Telegram, for example, assigns a unique update_id to every incoming message. These IDs are sequential. By storing the update_id of the last processed message, your skill can pick up exactly where it left off, only processing messages with a higher update_id. This ensures efficiency and prevents your output files from being filled with redundant information.

Walkthrough

Let's enhance your existing /telegram-notes skill to handle audio messages using Whisper and manage the processing state with update_id.

Task: Modify your /telegram-notes skill to include Whisper transcription and state management.

  1. Open your SKILL.md file: Navigate to .claude/skills/telegram-notes/SKILL.md in your project's file explorer (e.g., VS Code).

  2. Add Whisper transcription: Locate the part of your skill that processes messages. You'll need to add an instruction for Claude Code to transcribe audio messages using Whisper before it attempts to classify them.

# Example instruction to add to your SKILL.md - For any audio or voice messages, transcribe them into text using local Whisper. If Whisper is not installed, install it first.
This instruction tells Claude Code to identify audio messages and use Whisper to convert them into text, making them available for subsequent classification.
  1. Implement state management: You need to instruct Claude Code to save the update_id of the last processed message and then use this information on subsequent runs.
# Example instructions to add to your SKILL.md - After processing all messages, save the `update_id` of the last processed message to the file `telegram-notes/.last_update_id`. - When starting, if the file `telegram-notes/.last_update_id` exists, read the last `update_id` from it and only process messages with an `update_id` greater than the saved one.
This ensures that your skill will only fetch and process new messages, avoiding duplicates.
  1. Save and restart Claude Code: Save the SKILL.md file. Then, exit Claude Code (by typing /quit or pressing Ctrl+C) and restart it by typing claude in your terminal. This reloads your updated skill.

  2. Test your updated skill: Send a mix of 5-10 new text and voice messages to your Telegram bot. Then, run your skill:

    /telegram-notes
    

    Verify that both text and voice messages are processed and classified correctly in your ideas.md and tasks.md files.

  3. Test duplicate prevention: Send a couple more new messages (text or voice) to your bot. Then, run /telegram-notes again. Check your ideas.md and tasks.md files. The new messages should be added, but the previously processed messages should not be duplicated.

Common mistakes

  • Forgetting to restart Claude Code: If you modify a skill's SKILL.md file, Claude Code needs to be restarted for the changes to take effect.
  • Not specifying "local Whisper": If you don't specify "local Whisper," Claude Code might try to use a cloud-based transcription service, which could have privacy or cost implications.
  • Incorrectly handling update_id: Ensure the logic for saving and reading the .last_update_id file is clear and correctly implemented in your skill's instructions to avoid processing old messages or missing new ones.

Key takeaways

  • Whisper is an OpenAI model for local, multi-language speech-to-text transcription.
  • You can instruct Claude Code to use Whisper by simply asking it to "transcribe audio using local Whisper."
  • update_id is a unique, sequential identifier for Telegram messages, useful for tracking processing state.
  • Storing the last processed update_id allows your skill to avoid reprocessing old data and prevent duplicates.
  • Restart Claude Code after modifying SKILL.md for changes to take effect.
Completion · read

The student marks this lesson as read to continue. (Knowledge checks coming later.)

Statusdraft
Draft — not visible to students.
✨ Edit with a prompt
Danger zone