In Sweden, one in four people die alone, and in Spain, with the population pyramid completely inverted (now looking more like a spinning top), it’s only a matter of time before this happens as well: neighbors calling the police because of a bad smell, only to find a body that needs to be removed.

A striking case is that of Mayra Gomez Kem (the host of “1 2 3”), who spent 20 hours lying on the floor until someone finally noticed and called for help. The first 10 hours must have been hard, but the last 10, thinking you’re going to die lying there, must have been even worse.

We’re going to explore the possibility of creating an agent that roams around the house and raises an alert if it sees someone lying on the floor, triggering an actuator that presses the Red Cross emergency button.

Beyond the Uncanny Valley

Would you leave a humanoid robot switched on in the kitchen surrounded by knives and go to sleep peacefully? Not everyone would.

But there’s one platform we’re already comfortable with, one that doesn’t scare us: the “roomba.”

For elderly people, a platform like the enabot, capable of withstanding kicks and cane hits, would be more suitable — but that’s for later.

Proof of Concept

We’ll use a recycled Roomba 620 with a video camera mounted on top.

The Roomba is connected to an ESP32 Wroom with a magnetometer (AKA compass) and an accelerometer. The technical hardware details and schematics will be explained in another post.

If we want this to be a household appliance suitable for any home, we cannot depend on subscriptions, which may be subject to arbitrary regulations created by arbitrary politicians. So the model has to run on-prem.

I’ve used Ollama as the LLM server, and after searching through the catalog models, there’s none that can run on 16GB with both vision and tools. After spending quite some time with gemma3, which has vision, I realized it can return responses in JSON format. With the right prompt, we can parse the chat output and use it to control the robot.

Model Server

Hardware

For this use case, a CPU won’t do, since we’re working in a real-time environment and inference on CPU with image processing is absurdly slow.

We need a GPU capable of handling the model. I bought an NVIDIA RTX 4060ti with 16GB of VRAM, which is enough to run gemma3:12b, along with motherboard, processor, memory, and power supply all focused on supporting the GPU properly.

The case is from a contest I won back in university, which I’m fond of, and I’ve adapted it for modern motherboards. :-D

Software

I’ve already explained in another post how to set up the Ollama server on Ubuntu, so here’s a quick copy-paste recap.

Kubernetes:

sudo snap install kubectl
sudo snap install helm
sudo snap install microk8s
mkdir ~/.kube
microk8s.config > .kube/config
sudo microk8s.enable dns
sudo microk8s.enable ingress
sudo microk8s.enable hostpath-storage
sudo microk8s.enable nvidia

Ollama client:

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl stop ollama
sudo systemctl disable ollama
export OLLAMA_HOST=http://ollama.127-0-0-1.nip.io

Ollama server:

echo """
ollama:
  gpu:
    enabled: true
    type: nvidia
    number: 1
ingress:
  enabled: true
  hosts:
  - host: ollama.127-0-0-1.nip.io
    paths:
      - path: /
        pathType: Prefix
persistentVolume:
  enabled: true
""" > values.yaml
helm upgrade --install \
  ollama \
  --repo https://otwld.github.io/ollama-helm/ \
  --namespace ollama \
  --create-namespace \
  --wait \
  --values values.yaml \
  ollama
rm values.yaml

Once installed, we download the model:

export OLLAMA_HOST=http://ollama.127-0-0-1.nip.io
ollama pull gemma3:12b

Software

All code is in this repository.

Input Data

We have a directory with camera images taken once per second and an events file sent by the ESP32 Wroom and received by a simple Flask program that stores them in a shared volume.

Interaction with the LLM

The prompt instructs the LLM with input data, possible actions, and the expected response format:

PROMPT = """
You are the brain of an autonomous robot.

Your task is to decide the next action based on:

1. **Camera image** (provided directly as part of the context).
2. **Sensors data**: distances, obstacles, battery, temperature, tilt, etc.
3. **Current goal**: what you are trying to achieve right now.

### Possible actions
- `forward` (distance in cm)
- `backward` (distance in cm)
- `turn_left` (angle in degrees)
- `turn_right` (angle in degrees)
- `dock` (return to charging base)
- `stop` (stop movement)
- `start` (start movement)

### Response format
Respond **only** in valid JSON (no extra text) with the following structure:

{{
  "steps": [
    {{ "":  }}
  ],
  "goal": "",
  "thoughts": ""
  "description": ""
}}
"""

The main loop is very simple: grab sensor data, find the closest image, query the LLM, and execute the instructions it provides:

while True:
    sensors = get_latest_event(events_file=EVENTS_FILE)
    
    image_path = find_closest_image_path(image_dir=IMAGES_DIRECTORY, target_time=sensors["datetime"])

    response = query_llm(sensors, image_path, current_goal)

    instructions={}
    instructions["steps"] = response["steps"]
    execute_instructions(instructions)

The trickiest part is in the query_llmfunction, where since the model doesn’t have tools, we parse the chat output to extract the JSON:

def query_llm(sensors_data, image_path, current_goal, PROMPT=PROMPT):
    composed_prompt = PROMPT.format(sensors_data=sensors_data, current_goal=current_goal)

    response = ollama.chat(
        model="gemma3:12b",
        messages=[
            {
               "role": "user",
               "content": composed_prompt,
               "images": [image_path]
            }
       ] 
    )

    # Extract only the JSON block from answer
    raw = response.message.content.splitlines()
    inside_block = False
    filtered_lines = []
  
    for line in raw:
        if line.strip().startswith("```"):
            inside_block = not inside_block
            continue
        if inside_block:
            filtered_lines.append(line)
  
    json_str = "\n".join(filtered_lines).strip()
    parsed = json.loads(json_str)

    return(parsed)

And the result is:

Surprisingly, the model has NEVER failed to return a properly formatted JSON.

Lessons Learned for the Next Iteration

Spatial vision is crucial at short distances, or more sensors are needed. When approaching a wall, the LLM interprets the photo as blurry and uninformative.

To rely on images alone, two cameras are necessary, and the model must be trained with a composite image so it learns depth perception.

The camera needs about 180-degree coverage to fully understand its environment, even if the images get distorted.

Sensor data is sent too slowly. In this prototype, it could be improved by fetching all sensor data simultaneously.

Who Wants to Join Me on This Journey?

This project could help address a very real problem affecting millions of people — one that nobody seems to care about.

And I don’t mean so much for myself, since I have three kids and hope at least one of them will look after me. ;-)

Human in Danger: Agents to the Rescue

Beyond the Uncanny Valley

Proof of Concept

Model Server

Hardware

Software

Software

Input Data

Interaction with the LLM

Lessons Learned for the Next Iteration

Who Wants to Join Me on This Journey?

Suscríbete

Sobre el autor

Proyectos

Formulario de búsqueda

Beyond the Uncanny Valley

Proof of Concept

Model Server

Hardware

Software

Software

Input Data

Interaction with the LLM

Lessons Learned for the Next Iteration

Who Wants to Join Me on This Journey?

Suscríbete

Sobre el autor

Proyectos