Automated Feline Recognition: Object Detection for Smart Feeder Integration using YOLO11

As part of the CAS Deep Learning curriculum, this project explores the practical application of high-end computer vision in a domestic environment. The primary objective was the development of a robust object detection model based on the YOLO11 architecture to uniquely identify three specific subjects: Tigi, R.E.D., and Milou.

The long-term goal is the integration of this recognition system into a custom-engineered automated cat feeder. While the hardware remains in the conceptual phase, this research establishes the critical software foundation required for accurate subject identification and selective feeding.

Research Objectives and Practical Challenges

Beyond model implementation, this study sought to address several fundamental questions regarding real-world AI deployment:

  • Dataset Acquisition: Quantifying the effort required to curate a custom dataset and evaluating the utility of tools like Label Studio.
  • Transfer Learning on Custom Data: Assessing the complexity of adapting state-of-the-art architectures to niche, small-scale datasets.
  • Data Requirements: Determining the minimum volume of data necessary to achieve high-precision recognition of specific individuals.
  • Dataset Design Principles: Identifying critical environmental factors—such as lighting variance and background noise—that influence model robustness.
  • Optimization Strategies: Evaluating algorithmic interventions, such as weighted sampling, to resolve dataset flaws like class distribution imbalance.

Industry Applications and System Utility

Object detection represents one of the most transformative applications of Deep Learning, providing the sensory perception necessary for automated systems to interact with their environment. Key applications include:

  • Autonomous Systems: Real-time navigation and obstacle avoidance in vehicles and robotics.
  • Healthcare: Automated identification of pathologies within radiological imagery.
  • Industrial Automation: High-precision quality control and inventory management in logistics.
  • Smart Infrastructure: Advanced surveillance and behavioral pattern recognition in urban environments.

Dataset Characteristics and Development

The foundation of this study is a proprietary dataset comprising 844 images, meticulously curated to reflect the intended operational environment.

  • Data Source: Imagery captured via various mobile sensor platforms (iPhone) across a 15-year longitudinal period (2010–2025).
  • Environmental Diversity: The set includes multi-subject frames, negative samples (background only), and adversarial “fake subjects” (stuffed animals) to test the model’s discriminative power.
  • Distribution: The final training pipeline utilized 675 images for training and 169 for validation.

Mitigating the Impact of Class Distribution Imbalance

The dataset exhibits significant class imbalance, a direct reflection of real-world subject availability:

  • Tigi: 289 images (High availability)
  • Milou: 205 images (Moderate availability)
  • R.E.D.: 82 images (Low availability / Minority class)
  • Negative Samples: 48 frames containing no primary subjects to minimize false positive detections.

Left unmitigated, this distribution would likely lead to a model biased toward the majority class (Tigi). Addressing this required specialized sampling techniques during the training phase.

System Architecture and Implementation Framework

The technical stack was designed for efficient training, comprehensive tracking, and eventual edge deployment:

  • Core Architecture: YOLO11, utilizing both Nano and Small variants.
  • Experiment Tracking: Weights & Biases (wandb) for real-time metric logging and version control.
  • Data Engineering: Custom Python scripts for multi-format processing (RAW/HEIF) and dataset splitting.
  • Algorithmic Balancing: Implementation of a custom YOLOWeightedDataset and WeightedTrainer to equalize class influence via frequency-based loss weighting.

Iterative Model Optimization: A Technical Analysis

The optimization process was conducted across five distinct phases, moving from a baseline configuration to a highly tuned deployment model. Each iteration was driven by empirical performance data, targeting specific weaknesses identified in the previous run.

Phase I: Establishing the Baseline (YOLO11n)

The first phase utilized the YOLO11 Nano variant, selected for its high efficiency and suitability for eventual edge-device deployment. Training was conducted over 20 epochs using default parameters. The model achieved a baseline mAP50 of 0.976 and an mAP50-95 of 0.841. While overall recall was high at 0.938, detailed analysis revealed a precision bottleneck for Milou (0.853), suggesting that the model was prone to false positives for this subject. Additionally, the minority class, R.E.D., showed the lowest localization accuracy with an mAP50-95 of 0.819.

Phase II: Implementing Weighted Class Balancing

To rectify the discrepancies in subject performance, the second phase introduced a custom weighted data loader. This intervention calculated sampling probabilities based on class frequency, forcing the model to “attend” more to the minority subjects. This approach yielded immediate improvements for R.E.D., whose mAP50-95 score rose from 0.819 to 0.853. Precision for Milou also increased from 0.853 to 0.886, indicating fewer false identifications. However, this focused accuracy came at a slight cost to sensitivity; Tigi’s recall dropped from 0.923 to 0.897, highlighting the inherent trade-off between class fairness and majority-class performance.

Phase III: Convergence through Extended Training

Having balanced the subject distribution, the third phase focused on achieving deeper convergence by extending the training duration from 20 to 40 epochs. This proved to be the most successful configuration. The overall mAP50 reached a peak of 0.987, a significant improvement over the previous 0.975. R.E.D.’s precision saw a substantial gain, moving from 0.92 to 0.947, while Milou’s recall improved from 0.95 to 0.983. This phase demonstrated that with a high-quality, weighted dataset, the Nano architecture is capable of nearly industrial-grade precision across all subjects if given sufficient time to converge.

Phase IV: Environmental Adaptation and Photometric Augmentation

The fourth phase shifted focus from raw metrics to real-world robustness. Since the eventual deployment involves a static camera, geometric augmentations like rotation and shearing were disabled. Simultaneously, I maximized photometric augmentations (Hue, Saturation, and Value) to simulate the extreme lighting variances the feeder would face. Interestingly, this “realistic” tuning led to a slight regression in the held-out validation set, with the overall mAP50-95 dipping from 0.875 to 0.86. However, Tigi achieved her highest recall yet (0.989), and the model likely gained significant “unseen” robustness for field conditions that are not fully represented in the static validation split.

Phase V: Architectural Scaling (Nano vs. Small)

The final phase evaluated whether increasing model capacity could overcome the remaining performance gaps. I transitioned from the Nano model to YOLO11 Small, increasing the parameter count from 2.6M to 9.4M. This scaling significantly benefited the majority class, with Tigi reaching an exceptional mAP50-95 of 0.934 (compared to the Nano’s 0.910). However, the increased capacity negatively impacted the minority class; R.E.D.’s mAP50-95 regressed back to 0.823. This suggests that for a small-scale, 844-image dataset, larger models may begin to overfit on dominant patterns, reinforcing the conclusion that the Nano model remains the most efficient and balanced choice for this specific application.

Selection of the Optimal Model: Performance Synthesis

While several iterations yielded impressive metrics, the configuration developed in Phase III (Training Run 3) was selected as the “Best Model” for deployment. This decision was based on a holistic evaluation of the model’s ability to balance the needs of all three subjects while maintaining the efficiency required for real-time edge processing.

The Phase III model represents the global performance peak where the YOLO11 Nano architecture reached its maximum utility on this specific dataset. Unlike Phase I, which suffered from precision gaps, or Phase II, which prioritized accuracy at the expense of sensitivity, Phase III achieved a harmonious convergence. With both Precision and Recall exceeding the 95% threshold, the model demonstrates near-human accuracy in distinguishing between the cats. Furthermore, the mAP50-95 score of 0.875 indicates that the bounding boxes are not only accurate in identity but also highly precise in localization—a critical requirement for a feeder that needs to trigger serving mechanisms based on a subject’s proximity to a bowl.

Crucially, this model successfully resolved the “R.E.D. Problem.” By combining weighted sampling with extended training, the minority class performance was brought into parity with the majority classes, ensuring that the shyest cat receives the same recognition reliability as the most frequent visitor.

Final Performance Breakdown (Phase III Configuration)

SubjectImagesInstancesBox PrecisionBox RecallBox mAP50Box mAP50-95
Combined1691860.9510.9570.9870.875
Milou60600.9330.9830.9880.843
R.E.D.38380.9470.9450.9820.872
Tigi88880.9720.9430.9900.910

Technical Discussion and Synthesis

This research confirms several critical principles of modern computer vision:

  • Incremental Optimization: A hypothesis-driven approach to hyperparameters (epochs, weights, augmentations) is essential for developing domain-specific models.
  • Duration vs. Scale: On small datasets, increasing training duration on a compact architecture (Nano) often yields better generalization than simply increasing model size (Small).
  • Algorithmic Fairness: Weighted sampling is a mandatory intervention for real-world datasets where subject frequency is naturally skewed.
  • Augmentation Alignment: Augmentation strategies must be aligned with the physical constraints of the deployment hardware (e.g., a fixed-mount camera).

Post-Project Roadmap and Future Work

While the recognition model has reached production-level accuracy, several avenues for future development remain:

  • Iterative Dataset Refinement: Acquisition of higher-variance imagery for R.E.D. and Milou, specifically focusing on low-light and occluded scenarios.
  • Hyperparameter Optimization: Conduct a comprehensive grid search on learning rates and decay schedules using the Phase III configuration as a baseline.
  • Threshold Calibration: Implementation of class-specific confidence thresholds to further minimize false positives in the serving logic.
  • Hardware Integration: Deployment of the YOLO11 Nano model onto a Raspberry Pi platform, utilizing hardware acceleration to maintain real-time performance.

The successful development of this recognition system fulfills the initial software requirements for the smart feeder project. The next stage will focus on the mechanical and electronic engineering required to bring the physical feeder to life.