EgoTrigger: Toward Audio-Driven Image Capture for Human Memory Enhancement in All-Day Energy-Efficient Smart Glasses

Abstract

All-day smart glasses are likely to emerge as platforms capable of continuous contextual sensing, uniquely positioning them for unprecedented assistance in our daily lives. Integrating the multi-modal AI agents required for human memory enhancement while performing continuous sensing, however, presents a major energy efficiency challenge for all-day usage. Achieving this balance requires intelligent, context-aware sensor management. Our approach, EgoTrigger, leverages audio cues from the microphone to selectively activate power-intensive cameras, enabling efficient sensing while preserving substantial utility for human memory enhancement. EgoTrigger uses a lightweight audio model (YAMNet) and a custom classification head to trigger image capture from hand-object interaction (HOI) audio cues, such as the sound of a drawer opening or a medication bottle being opened.

In addition to evaluating on the QA-Ego4D dataset, we introduce and evaluate on the Human Memory Enhancement Question-Answer (HME-QA) dataset. Our dataset contains 340 human-annotated first-person QA pairs from full-length Ego4D videos that were curated to ensure that they contained audio, focusing on HOI moments critical for contextual understanding and memory. Our results show EgoTrigger can use 54% fewer frames on average, significantly saving energy in both power-hungry sensing components (e.g., cameras) and downstream operations (e.g., wireless transmission), while achieving comparable performance on datasets for an episodic memory task. We believe this context-aware triggering strategy represents a promising direction for enabling energy-efficient, functional smart glasses capable of all-day use -- supporting applications like helping users recall where they placed their keys or information about their routine activities (e.g., taking medications).

Architecture

Our approach focuses on audio from smart glasses that are processed by a pre-trained YAMNet model to extract 1024-dimensional embeddings, which are passed through a 4-layer dense network that serves as our custom, binary classification head.

Key Results

Sample question-answer outcomes on QA-Ego4D (left) and HME-QA (right) using different video sampling strategies. EgoTrigger variants preserve key visual content necessary for correct answers while significantly reducing the number of frames, compared to full or naively decimated baselines.

EgoTrigger variants (ET-1s and ET-Hyst.) lie close to the Pareto frontier, balancing QA accuracy with reduced frame count. Results are shown for both HME-QA and QA-Ego4D datasets, with dataset indicated by marker shape (circles for HME-QA, squares for QA-Ego4D) and estimated bitrate reflected by marker size. Outlined markers highlight our proposed methods. Variant ET-1s in particular reduces frame usage and subsequently bitrate while preserving accuracy close to full-frame baselines.

Please refer to our pre-print for the full results, including ablation studies.

Additional Materials

We will release our code and access to the HME-QA dataset annotations through our GitHub repository.

BibTeX

@article{paruchuri2025egotrigger,
  title={EgoTrigger: Toward Audio-Driven Image Capture for Human Memory Enhancement in All-Day Energy-Efficient Smart Glasses},
  author={Paruchuri, Akshay and Hersek, Sinan and Aggarwal, Lavisha and Yang, Qiao and Liu, Xin and Kulshrestha, Achin and Colaco, Andrea and Fuchs, Henry and Chatterjee, Ishan},
  journal={arXiv preprint arXiv:2508.01915},
  year={2025}
}