Real-time Object Detection and Streaming with Python, OpenCV, AWS Kinesis, and YOLOv8

November 6, 2023November 4, 2023 ~ Gonzalo Ayuso ~ 2 Comments

I’ve got the following idea in my mind. I want to detect objects with YoloV8 and OpenCV. Here a simple example:

import cv2 as cv
from ultralytics import YOLO
import logging

logging.getLogger("ultralytics").setLevel(logging.WARNING)

logging.basicConfig(
    format='%(asctime)s [%(levelname)s] %(message)s',
    level='INFO',
    datefmt='%d/%m/%Y %X')

logger = logging.getLogger(__name__)

X_RESIZE = 800
Y_RESIZE = 450

YOLO_MODEL="yolov8m.pt"
YOLO_CLASSES_TO_DETECT = 'bottle',
THRESHOLD = 0.80

cap = cv.VideoCapture(0)
model = YOLO(YOLO_MODEL)
while True:
    _, original_frame = cap.read()
    frame = cv.resize(original_frame, (X_RESIZE, Y_RESIZE), interpolation=cv.INTER_LINEAR)

    results = model.predict(frame)

    color = (0, 255, 0)
    for result in results:
        for box in result.boxes:
            class_id = result.names[box.cls[0].item()]
            status = id == class_id
            cords = box.xyxy[0].tolist()
            probability = round(box.conf[0].item(), 2)

            if probability > THRESHOLD and class_id in YOLO_CLASSES_TO_DETECT:
                logger.info(f"{class_id} {round(probability * 100)}%")
                x1, y1, x2, y2 = [round(x) for x in cords]
                cv.rectangle(frame, (x1, y1), (x2, y2), color, 2)

    cv.imshow(f"Cam", frame)

    if cv.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv.destroyAllWindows()

However, my idea is slightly different. I intend to capture the camera stream, transmit it to AWS Kinesis, and subsequently, within a separate process, ingest this Kinesis stream for object detection or any other purpose using the camera frames. To achieve this, our initial step is to publish our OpenCV frames into Kinesis. producer.py

import logging

from lib.video import kinesis_producer
from settings import KINESIS_STREAM_NAME, CAM_URI, WIDTH

logging.basicConfig(
    format='%(asctime)s [%(levelname)s] %(message)s',
    level='INFO',
    datefmt='%d/%m/%Y %X')

logger = logging.getLogger(__name__)

if __name__ == '__main__':
    kinesis_producer(
        kinesis_stream_name=KINESIS_STREAM_NAME,
        cam_uri=CAM_URI,
        width=WIDTH
    )

def kinesis_producer(kinesis_stream_name, cam_uri, width):
    logger.info(f"start emitting stream from cam={cam_uri} to kinesis")
    kinesis = aws_get_service('kinesis')
    cap = cv.VideoCapture(cam_uri)
    try:
        while True:
            ret, frame = cap.read()
            if ret:
                scale = width / frame.shape[1]
                height = int(frame.shape[0] * scale)

                scaled_frame = cv.resize(frame, (width, height))
                _, img_encoded = cv.imencode('.jpg', scaled_frame)
                kinesis.put_record(
                    StreamName=kinesis_stream_name,
                    Data=img_encoded.tobytes(),
                    PartitionKey='1'
                )

    except KeyboardInterrupt:
        cap.release()

Now our consumer.py

import logging

from lib.video import kinesis_consumer
from settings import KINESIS_STREAM_NAME

logging.getLogger("ultralytics").setLevel(logging.WARNING)

logging.basicConfig(
    format='%(asctime)s [%(levelname)s] %(message)s',
    level='INFO',
    datefmt='%d/%m/%Y %X')

logger = logging.getLogger(__name__)

if __name__ == '__main__':
    kinesis_consumer(
        kinesis_stream_name=KINESIS_STREAM_NAME
    )

To get the frames from kinesis stream we need to consider the shards. According to your needs you should change it a little bit.

def kinesis_consumer(kinesis_stream_name):
    logger.info(f"start reading kinesis stream from={kinesis_stream_name}")
    kinesis = aws_get_service('kinesis')
    model = YOLO(YOLO_MODEL)
    response = kinesis.describe_stream(StreamName=kinesis_stream_name)
    shard_id = response['StreamDescription']['Shards'][0]['ShardId']
    while True:
        shard_iterator_response = kinesis.get_shard_iterator(
            StreamName=kinesis_stream_name,
            ShardId=shard_id,
            ShardIteratorType='LATEST'
        )
        shard_iterator = shard_iterator_response['ShardIterator']
        response = kinesis.get_records(
            ShardIterator=shard_iterator,
            Limit=10
        )

        for record in response['Records']:
            image_data = record['Data']
            frame = cv.imdecode(np.frombuffer(image_data, np.uint8), cv.IMREAD_COLOR)
            results = model.predict(frame)
            if detect_id_in_results(results):
                logger.info('Detected')

def detect_id_in_results(results):
    status = False
    for result in results:
        for box in result.boxes:
            class_id = result.names[box.cls[0].item()]
            probability = round(box.conf[0].item(), 2)
            if probability > THRESHOLD and class_id in YOLO_CLASSES_TO_DETECT:
                status = True
    return status

And that’s all. Real-time Object Detection with Kinesis stream. However, computer vision examples are often simple and quite straightforward. On the other hand, real-world computer vision projects tend to be more intricate, involving considerations such as cameras, lighting conditions, performance optimization, and accuracy. But remember, this is just an example.

Full code in my github.

Happy logins. Only the happy user will pass

May 7, 2018 ~ Gonzalo Ayuso ~ 1 Comment

Login forms are bored. In this example we’re going to create an especial login form. Only for happy users. Happiness is something complicated, but at least, one smile is more easy to obtain, and all is better with one smile :). Our login form will only appear if the user smiles. Let’s start.

I must admit that this project is just an excuse to play with different technologies that I wanted to play. Weeks ago I discovered one library called face_classification. With this library I can perform emotion classification from a picture. The idea is simple. We create RabbitMQ RPC server script that answers with the emotion of the face within a picture. Then we obtain on frame from the video stream of the webcam (with HTML5) and we send this frame using websocket to a socket.io server. This websocket server (node) ask to the RabbitMQ RPC the emotion and it sends back to the browser the emotion and a the original picture with a rectangle over the face.

Frontend

As well as we’re going to use socket.io for websockets we will use the same script to serve the frontend (the login and the HTML5 video capture)

&lt;!doctype html&gt;
&lt;html&gt;
&lt;head&gt;
    &lt;title&gt;Happy login&lt;/title&gt;
    &lt;link rel=&quot;stylesheet&quot; href=&quot;css/app.css&quot;&gt;
&lt;/head&gt;
&lt;body&gt;

&lt;div id=&quot;login-page&quot; class=&quot;login-page&quot;&gt;
    &lt;div class=&quot;form&quot;&gt;
        &lt;h1 id=&quot;nonHappy&quot; style=&quot;display: block;&quot;&gt;Only the happy user will pass&lt;/h1&gt;
        &lt;form id=&quot;happyForm&quot; class=&quot;login-form&quot; style=&quot;display: none&quot; onsubmit=&quot;return false;&quot;&gt;
            &lt;input id=&quot;user&quot; type=&quot;text&quot; placeholder=&quot;username&quot;/&gt;
            &lt;input id=&quot;pass&quot; type=&quot;password&quot; placeholder=&quot;password&quot;/&gt;
            &lt;button id=&quot;login&quot;&gt;login&lt;/button&gt;
            &lt;p&gt;&lt;/p&gt;
            &lt;img id=&quot;smile&quot; width=&quot;426&quot; height=&quot;320&quot; src=&quot;&quot;/&gt;
        &lt;/form&gt;
        &lt;div id=&quot;video&quot;&gt;
            &lt;video style=&quot;display:none;&quot;&gt;&lt;/video&gt;
            &lt;canvas id=&quot;canvas&quot; style=&quot;display:none&quot;&gt;&lt;/canvas&gt;
            &lt;canvas id=&quot;canvas-face&quot; width=&quot;426&quot; height=&quot;320&quot;&gt;&lt;/canvas&gt;
        &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;

&lt;div id=&quot;private&quot; style=&quot;display: none;&quot;&gt;
    &lt;h1&gt;Private page&lt;/h1&gt;
&lt;/div&gt;

&lt;script src=&quot;https://code.jquery.com/jquery-3.2.1.min.js&quot; integrity=&quot;sha256-hwg4gsxgFZhOsEEamdOYGBf13FyQuiTwlAQgxVSNgt4=&quot; crossorigin=&quot;anonymous&quot;&gt;&lt;/script&gt;
&lt;script src=&quot;https://unpkg.com/sweetalert/dist/sweetalert.min.js&quot;&gt;&lt;/script&gt;
&lt;script type=&quot;text/javascript&quot; src=&quot;/socket.io/socket.io.js&quot;&gt;&lt;/script&gt;
&lt;script type=&quot;text/javascript&quot; src=&quot;/js/app.js&quot;&gt;&lt;/script&gt;
&lt;/body&gt;
&lt;/html&gt;

Here we’ll connect to the websocket and we’ll emit the webcam frame to the server. We´ll also be listening to one event called ‘response’ where server will notify us when one emotion has been detected.

let socket = io.connect(location.origin),
    img = new Image(),
    canvasFace = document.getElementById('canvas-face'),
    context = canvasFace.getContext('2d'),
    canvas = document.getElementById('canvas'),
    width = 640,
    height = 480,
    delay = 1000,
    jpgQuality = 0.6,
    isHappy = false;

socket.on('response', function (r) {
    let data = JSON.parse(r);
    if (data.length &gt; 0 &amp;&amp; data[0].hasOwnProperty('emotion')) {
        if (isHappy === false &amp;&amp; data[0]['emotion'] === 'happy') {
            isHappy = true;
            swal({
                title: &quot;Good!&quot;,
                text: &quot;All is better with one smile!&quot;,
                icon: &quot;success&quot;,
                buttons: false,
                timer: 2000,
            });

            $('#nonHappy').hide();
            $('#video').hide();
            $('#happyForm').show();
            $('#smile')[0].src = 'data:image/png;base64,' + data[0].image;
        }

        img.onload = function () {
            context.drawImage(this, 0, 0, canvasFace.width, canvasFace.height);
        };

        img.src = 'data:image/png;base64,' + data[0].image;
    }
});

navigator.getMedia = (navigator.getUserMedia || navigator.webkitGetUserMedia || navigator.mozGetUserMedia);

navigator.getMedia({video: true, audio: false}, (mediaStream) =&gt; {
    let video = document.getElementsByTagName('video')[0];
    video.src = window.URL.createObjectURL(mediaStream);
    video.play();
    setInterval(((video) =&gt; {
        return function () {
            let context = canvas.getContext('2d');
            canvas.width = width;
            canvas.height = height;
            context.drawImage(video, 0, 0, width, height);
            socket.emit('img', canvas.toDataURL('image/jpeg', jpgQuality));
        }
    })(video), delay)
}, error =&gt; console.log(error));

$(() =&gt; {
    $('#login').click(() =&gt; {
        $('#login-page').hide();
        $('#private').show();
    })
});

Backend
Finally we’ll work in the backend. Basically I’ve check the examples that we can see in face_classification project and tune it a bit according to my needs.

from rabbit import builder
import logging
import numpy as np
from keras.models import load_model
from utils.datasets import get_labels
from utils.inference import detect_faces
from utils.inference import draw_text
from utils.inference import draw_bounding_box
from utils.inference import apply_offsets
from utils.inference import load_detection_model
from utils.inference import load_image
from utils.preprocessor import preprocess_input
import cv2
import json
import base64

detection_model_path = 'trained_models/detection_models/haarcascade_frontalface_default.xml'
emotion_model_path = 'trained_models/emotion_models/fer2013_mini_XCEPTION.102-0.66.hdf5'
emotion_labels = get_labels('fer2013')
font = cv2.FONT_HERSHEY_SIMPLEX

# hyper-parameters for bounding boxes shape
emotion_offsets = (20, 40)

# loading models
face_detection = load_detection_model(detection_model_path)
emotion_classifier = load_model(emotion_model_path, compile=False)

# getting input model shapes for inference
emotion_target_size = emotion_classifier.input_shape[1:3]


def format_response(response):
    decoded_json = json.loads(response)
    return &quot;Hello {}&quot;.format(decoded_json['name'])


def on_data(data):
    f = open('current.jpg', 'wb')
    f.write(base64.decodebytes(data))
    f.close()
    image_path = &quot;current.jpg&quot;

    out = []
    # loading images
    rgb_image = load_image(image_path, grayscale=False)
    gray_image = load_image(image_path, grayscale=True)
    gray_image = np.squeeze(gray_image)
    gray_image = gray_image.astype('uint8')

    faces = detect_faces(face_detection, gray_image)
    for face_coordinates in faces:
        x1, x2, y1, y2 = apply_offsets(face_coordinates, emotion_offsets)
        gray_face = gray_image[y1:y2, x1:x2]

        try:
            gray_face = cv2.resize(gray_face, (emotion_target_size))
        except:
            continue

        gray_face = preprocess_input(gray_face, True)
        gray_face = np.expand_dims(gray_face, 0)
        gray_face = np.expand_dims(gray_face, -1)
        emotion_label_arg = np.argmax(emotion_classifier.predict(gray_face))
        emotion_text = emotion_labels[emotion_label_arg]
        color = (0, 0, 255)

        draw_bounding_box(face_coordinates, rgb_image, color)
        draw_text(face_coordinates, rgb_image, emotion_text, color, 0, -50, 1, 2)
        bgr_image = cv2.cvtColor(rgb_image, cv2.COLOR_RGB2BGR)

        cv2.imwrite('predicted.png', bgr_image)
        data = open('predicted.png', 'rb').read()
        encoded = base64.encodebytes(data).decode('utf-8')
        out.append({
            'image': encoded,
            'emotion': emotion_text,
        })

    return out

logging.basicConfig(level=logging.WARN)
rpc = builder.rpc(&quot;image.check&quot;, {'host': 'localhost', 'port': 5672})
rpc.server(on_data)

Here you can see in action the working prototype

Maybe we can do the same with another tools and even more simple but as I said before this example is just an excuse to play with those technologies:

Send webcam frames via websockets
Connect one web application to a Pyhon application via RabbitMQ RPC
Play with face classification script

Please don’t use this script in production. It’s just a proof of concepts. With smiles but a proof of concepts 🙂

You can see the project in my github account

Opencv and esp32 experiment. Moving a servo with my face alignment

April 9, 2018 ~ Gonzalo Ayuso ~ Leave a comment

One saturday morning I was having a breakfast and I discovered face_recognition project. I started to play with the opencv example. I put my picture and, Wow! It works like a charm. It’s pretty straightforward to detect my face and also I can obtain the face landmarks. One of the landmark that I can get is the nose tip. Playing with this script I realized that with the nose tip I can determine the position of the face. I can see if my face is align to the center or if I move it to one side. As well as I have a new iot device (one ESP32) I wanted to do something with it. For example control a servo (SG90) and moving it from left to right depending on my face position.

First we have the main python script. With this script I detect my face, the nose tip and the position of my face. With this position I will emit an event to a mqtt broker (a mosquitto server running on my laptop).

import face_recognition
import cv2
import numpy as np
import math
import paho.mqtt.client as mqtt

video_capture = cv2.VideoCapture(0)

gonzalo_image = face_recognition.load_image_file(&quot;gonzalo.png&quot;)
gonzalo_face_encoding = face_recognition.face_encodings(gonzalo_image)[0]

known_face_encodings = [
    gonzalo_face_encoding
]
known_face_names = [
    &quot;Gonzalo&quot;
]

RED = (0, 0, 255)
GREEN = (0, 255, 0)
BLUE = (255, 0, 0)

face_locations = []
face_encodings = []
face_names = []
process_this_frame = True
status = ''
labelColor = GREEN

client = mqtt.Client()
client.connect(&quot;localhost&quot;, 1883, 60)

while True:
    ret, frame = video_capture.read()

    # Resize frame of video to 1/4 size for faster face recognition processing
    small_frame = cv2.resize(frame, (0, 0), fx=0.25, fy=0.25)

    # Convert the image from BGR color (which OpenCV uses) to RGB color (which face_recognition uses)
    rgb_small_frame = small_frame[:, :, ::-1]

    face_locations = face_recognition.face_locations(rgb_small_frame)
    face_encodings = face_recognition.face_encodings(rgb_small_frame, face_locations)
    face_landmarks_list = face_recognition.face_landmarks(rgb_small_frame, face_locations)

    face_names = []
    for face_encoding, face_landmarks in zip(face_encodings, face_landmarks_list):
        matches = face_recognition.compare_faces(known_face_encodings, face_encoding)
        name = &quot;Unknown&quot;

        if True in matches:
            first_match_index = matches.index(True)
            name = known_face_names[first_match_index]

            nose_tip = face_landmarks['nose_tip']
            maxLandmark = max(nose_tip)
            minLandmark = min(nose_tip)

            diff = math.fabs(maxLandmark[1] - minLandmark[1])
            if diff &lt; 2:
                status = &quot;center&quot;
                labelColor = BLUE
                client.publish(&quot;/face/{}/center&quot;.format(name), &quot;1&quot;)
            elif maxLandmark[1] &gt; minLandmark[1]:
                status = &quot;&gt;&gt;&gt;&gt;&quot;
                labelColor = RED
                client.publish(&quot;/face/{}/left&quot;.format(name), &quot;1&quot;)
            else:
                status = &quot;&lt;&lt;&lt;&lt;&quot;
                client.publish(&quot;/face/{}/right&quot;.format(name), &quot;1&quot;)
                labelColor = RED

            shape = np.array(face_landmarks['nose_bridge'], np.int32)
            cv2.polylines(frame, [shape.reshape((-1, 1, 2)) * 4], True, (0, 255, 255))
            cv2.fillPoly(frame, [shape.reshape((-1, 1, 2)) * 4], GREEN)

        face_names.append(&quot;{} {}&quot;.format(name, status))

    for (top, right, bottom, left), name in zip(face_locations, face_names):
        # Scale back up face locations since the frame we detected in was scaled to 1/4 size
        top *= 4
        right *= 4
        bottom *= 4
        left *= 4

        if 'Unknown' not in name.split(' '):
            cv2.rectangle(frame, (left, top), (right, bottom), labelColor, 2)
            cv2.rectangle(frame, (left, bottom - 35), (right, bottom), labelColor, cv2.FILLED)
            cv2.putText(frame, name, (left + 6, bottom - 6), cv2.FONT_HERSHEY_DUPLEX, 1.0, (255, 255, 255), 1)
        else:
            cv2.rectangle(frame, (left, top), (right, bottom), BLUE, 2)

    cv2.imshow('Video', frame)

    if cv2.waitKey(1) &amp; 0xFF == ord('q'):
        break

video_capture.release()
cv2.destroyAllWindows()

Now another Python script will be listening to mqtt events and it will trigger one event with the position of the servo. I know that this second Python script maybe is unnecessary. We can move its logic to esp32 and main opencv script, but I was playing with mqtt and I wanted to decouple it a little bit.

import paho.mqtt.client as mqtt

class Iot:
    _state = None
    _client = None
    _dict = {
        'left': 0,
        'center': 1,
        'right': 2
    }

    def __init__(self, client):
        self._client = client

    def emit(self, name, event):
        if event != self._state:
            self._state = event
            self._client.publish(&quot;/servo&quot;, self._dict[event])
            print(&quot;emit /servo envent with value {} - {}&quot;.format(self._dict[event], name))


def on_message(topic, iot):
    data = topic.split(&quot;/&quot;)
    name = data[2]
    action = data[3]
    iot.emit(name, action)


client = mqtt.Client()
iot = Iot(client)

client.on_connect = lambda self, mosq, obj, rc: self.subscribe(&quot;/face/#&quot;)
client.on_message = lambda client, userdata, msg: on_message(msg.topic, iot)

client.connect(&quot;localhost&quot;, 1883, 60)
client.loop_forever()

And finally the ESP32. Here will connect to my wifi and to my mqtt broker.

#include &lt;WiFi.h&gt;
#include &lt;PubSubClient.h&gt;

#define LED0 17
#define LED1 18
#define LED2 19
#define SERVO_PIN 5

// wifi configuration
const char* ssid = &quot;my_ssid&quot;;
const char* password = &quot;my_wifi_password&quot;;
// mqtt configuration
const char* server = &quot;192.168.1.111&quot;; // mqtt broker ip
const char* topic = &quot;/servo&quot;;
const char* clientName = &quot;com.gonzalo123.esp32&quot;;

int channel = 1;
int hz = 50;
int depth = 16;

WiFiClient wifiClient;
PubSubClient client(wifiClient);

void wifiConnect() {
  Serial.print(&quot;Connecting to &quot;);
  Serial.println(ssid);

  WiFi.begin(ssid, password);

  while (WiFi.status() != WL_CONNECTED) {
    delay(500);
    Serial.print(&quot;*&quot;);
  }

  Serial.print(&quot;WiFi connected: &quot;);
  Serial.println(WiFi.localIP());
}

void mqttReConnect() {
  while (!client.connected()) {
    Serial.print(&quot;Attempting MQTT connection...&quot;);
    if (client.connect(clientName)) {
      Serial.println(&quot;connected&quot;);
      client.subscribe(topic);
    } else {
      Serial.print(&quot;failed, rc=&quot;);
      Serial.print(client.state());
      Serial.println(&quot; try again in 5 seconds&quot;);
      delay(5000);
    }
  }
}

void callback(char* topic, byte* payload, unsigned int length) {
  Serial.print(&quot;Message arrived [&quot;);
  Serial.print(topic);

  String data;
  for (int i = 0; i &lt; length; i++) {
    data += (char)payload[i];
  }

  int value = data.toInt();
  cleanLeds();
  switch (value)  {
    case 0:
      ledcWrite(1, 3400);
      digitalWrite(LED0, HIGH);
      break;
    case 1:
      ledcWrite(1, 4900);
      digitalWrite(LED1, HIGH);
      break;
    case 2:
      ledcWrite(1, 6400);
      digitalWrite(LED2, HIGH);
      break;
  }
  Serial.print(&quot;] value:&quot;);
  Serial.println((int) value);
}

void cleanLeds() {
  digitalWrite(LED0, LOW);
  digitalWrite(LED1, LOW);
  digitalWrite(LED2, LOW);
}

void setup() {
  Serial.begin(115200);

  ledcSetup(channel, hz, depth);
  ledcAttachPin(SERVO_PIN, channel);

  pinMode(LED0, OUTPUT);
  pinMode(LED1, OUTPUT);
  pinMode(LED2, OUTPUT);
  cleanLeds();
  wifiConnect();
  client.setServer(server, 1883);
  client.setCallback(callback);

  delay(1500);
}

void loop()
{
  if (!client.connected()) {
    mqttReConnect();
  }

  client.loop();
  delay(100);
}

Here a video with the working prototype in action

The source code is available in my github account.

Tracking blue objects with Opencv and Python

November 20, 2017 ~ Gonzalo Ayuso ~ Leave a comment

Opencv is an amazing Open Source Computer Vision Library. Today We’re going to hack a little bit with it. The idea is track blue objects. Why blue objects? Maybe because I’ve got a couple of them in my desk. Let’s start.

The idea is simple. We’ll create a mask. Our mask is a black and white image where each blue pixel will turn into a white one and the rest of pixels will be black.

Original frame:

Masked one:

Now we only need put a bounding rectangle around the blue object.

import cv2
import numpy

cam = cv2.VideoCapture(0)
kernel = numpy.ones((5 ,5), numpy.uint8)

while (True):
    ret, frame = cam.read()
    rangomax = numpy.array([255, 50, 50]) # B, G, R
    rangomin = numpy.array([51, 0, 0])
    mask = cv2.inRange(frame, rangomin, rangomax)
    # reduce the noise
    opening = cv2.morphologyEx(mask, cv2.MORPH_OPEN, kernel)

    x, y, w, h = cv2.boundingRect(opening)

    cv2.rectangle(frame, (x, y), (x+w, y + h), (0, 255, 0), 3)
    cv2.circle(frame, (x+w/2, y+h/2), 5, (0, 0, 255), -1)

    cv2.imshow('camera', frame)

    k = cv2.waitKey(1) &amp; 0xFF

    if k == 27:
        break

And that’s all. A nice hack for a Sunday morning

Source code in my github account