Log by log: See how we build reCamera V2.0! Platform benchmarks, CAD iterations, deep debug dives. Open build with an engineer’s eye view!
To make the experience fit your profile, pick a username and tell us what interests you.
🤔 How can you understand the birth of this AI Camera series, which we name with reCamera? The following explanation may help you understand this clearly.
⛳Our vision is clear: to build a comprehensive matrix of AI cameras that meets diverse real-world needs. Last year, we took a significant step forward by launching reCamera 2002—an open-source AI camera reference design hailed as "the shortest pathway to a market-ready AI camera."
From its earliest development stages, reCamera 2002 (please find a comprehensive intro below👇) was shaped by invaluable feedback and suggestions from our vibrant community. It was your collective voice that helped us craft this beloved first-generation device. 💚
Therefore, as we began planning the next evolution of reCamera and the specific construction of our broader AI camera matrix, we knew we wanted to share everything with the community, embracing a radical open-source spirit! 🫶
This is why we're launching "Peek Under the Hood: How to Build an AI Camera?" – an ongoing series where we open-source and continuously update every detail and critical step involved in building our AI cameras, right from the very start.
Through this open-source journey, we aim to:
Join us as we openly pioneer the future of accessible, powerful AI Vision hardware—together! 👐
💡 Why do we make this first-gen reCamera?
Today, as processors (both SOC and MCU) are becoming smaller and more powerful, it is now possible to combine the processor with camera sensors. In fact, many IPCs (IP cameras) are already using this design to accelerate AI detection on edge devices.
So today, we introduce reCamera, an open-source camera platform for everyone to play with. We have divided this project into 3 interchangeable parts:
This design allows users to easily change the interfaces and camera sensors to suit their needs. We hope that these components can be freely combined in any way.
By building this hardware platform and ecosystem, we enable other applications to grow on this platform without the need to worry about changing from one platform to another in the future.
The engineering focus on modularity, high performance, and ease of use ensures that reCamera serves as a powerful platform for developers and makers. This design streamlines complex hardware processes, allowing users to integrate vision AI capabilities into their systems efficiently and creatively.
We've taken care of the intricate hardware work, freeing up time for user innovation. The modular design enables users to rapidly switch cameras and customize interfaces, transforming development from months to weeks only!
After several days of waiting, both the PCB board and the 3D model have been printed. All the components are shown in the figure below. We have pre-soldered the PCB and connected the components to the adapter board. Now let's assemble them.
First, fix the main body of the head to the main board with screws.
Then place them in the groove of the handle front cover.
Insert the adapter board into the motherboard, first place the 5-key button, then the knob, and finally the battery.
Then cover the head cover. It is necessary to install the magnet in the magnet slot first, then insert the protrusion of the cover into the empty position of the main board adapter plate, and press it gently.
Finally, cover the handle back cover.
Completed.
Assembly video:
Overall, after all the components are installed into the shell, they are relatively stable, which indicates that our shell design is acceptable. Next, we will try to make the first demo: Camera + ranging module.
After selecting the hardware for the SensorCam P4 last week, we found that the interface led out from the main board is located at the bottom right corner of the board. However, what we need to create is a plug-and-play structure. If it is directly plugged in at the original position, the sensor module will not be at the center of the screen, and the obtained data will also be inaccurate. Therefore, we decided to make an adapter board to arrange the pins at the center of the screen, so that more accurate data can be obtained. The schematic diagram is as follows:
Among them, U1 is a 40-pin header used to connect the interface led out from the main board. U2, U3, and U4 are pins led out in the middle of the screen, and a three-sided design is adopted to enhance the stability when the module is plugged in. U5 is the interface for the button, and U6 is the interface for the knob. Among them, pins 4, 5, and 32 will be used as module fixing pins, but they are not entirely fixed pins. The levels of these three pins are different for each module, so these three pins can be used to distinguish different modules. The main controller automatically identifies the module by reading the levels of these three pins, thereby displaying different UIs and data. Theoretically, SensorCam P4 can automatically identify up to 8 modules, but other pins can also be added for expansion. Below is the 3D model of the adapter board:
In addition to the adapter board, we have also made adapter boards for the ranging module and the thermal imaging module to adapt to the main adapter board, so as to achieve a foolproof direct plug design. Below are the schematic diagram and 3D diagram of the adapter board for the ranging module:
Among them, H5, H6, and H7 are ranging modules with an identification code of 000, and their pins are all fixed, so there is no need to worry about inserting the pins incorrectly. The adapter board of the thermal imaging module may be more complicated because it is a board-to-board connector, as shown in the figure below:
We have led out the necessary IIC interface and SPI interface for it. The other unused pins are temporarily led out and can be chosen not to be used. A filter capacitor has also been added to the power supply. Below is its 3D diagram:
H14, H15, and H16 are connected to VCC, GND, and GND respectively, so the coding of the thermal imaging module is 100. Later, we will model the overall shell based on the actual size of these adapter boards connected to the main board.
The SensorCam P4 is a handheld device similar in shape to a magnifying glass. To ensure convenience during 3D printing, we divided the device into the handle and the head. The overall design is shown in the figure below:
The combined state is shown in the figure below:
1. Diagram of the connection between the handle and the head
2. Connection Diagram of Handle Front and Rear Covers (Cross Section)
3. Magnetic connection diagram of the head and the back cover
❶❷❸❹: Magnet placement slot.
❺❻❼❽: Screw fixing hole.
❾❿: Type-c port.
⓫: USB port.
⓬⓭: Button port.
⓮: SD card port.
⓯: Wire channel.
⓰: Reserved space for camera interface.
❶❷❸❹: Magnet placement slot.
❺❻❼: Reserved slot for the female header of the main adapter board.
❽: Adapter board support, preventing the adapter board from deforming inward when the module is inserted.
❶❷❸❹: Coupling female socket.
❺: Knob reserved hole.
❻❼: Battery DC charging port support.
❽: Wire channel.
Front Side
Back Side
After completing the screen selection, we further planned the overall hardware design scheme for SensorCam P4, which roughly includes the following components:
The handle contains batteries. Next, I will elaborate on our selection considerations and final decisions regarding the mainboard, control components, power supply system, and other aspects.
After the screen model was selected, we found that there were compatibility challenges between the ESP32P4 chip and the originally planned display screen. After evaluating various aspects such as the form, interface, and performance of the main board, we finally chose an ESP32P4 development board integrated with a 3.4-inch touch display screen as the core main board, as shown in the figure below:
This development board has the following advantages:
● Compact design: The core board is small enough to save valuable space.
● Rich interfaces: Provides a variety of peripheral interfaces with strong expandability.
● Communication capability: Onboard ESP32C6MINI module, supporting WiFi connection.
● Excellent display performance: 800×800 high resolution, 70% NTSC wide color gamut, 300cd/m² high brightness.
● Form fit: The shape design meets the ergonomic requirements of a handheld magnifying glass.
This motherboard not only solves compatibility issues, but its highly integrated design also greatly simplifies our system architecture and reduces the overall complexity.
Why are physical knobs still needed in the touchscreen era? Because we recognize that for fine operations (such as image zooming, parameter adjustment, and menu navigation), the tactile feedback and precise control provided by physical knobs are difficult for touchscreens to replace. The knob we chose is the EC11, as shown in the figure below:
The EC11 knob has the following characteristics:
● Suitable size: The diameter and height conform to ergonomics, ensuring comfortable operation.
● Exquisite appearance: The metallic texture is consistent with the overall style of the device.
● High performance: High sensitivity and accurate scale feedback.
● Economy: Extremely high cost performance and stable market supply.
The EC11 knob will provide users with an intuitive parameter adjustment experience, and it performs exceptionally well especially in scenarios that require quick and precise adjustments.
Although equipped with a touchscreen, we firmly believe that physical buttons still have irreplaceable value in specific scenarios: for quick operations, blind operation needs, and providing clear physical feedback.The buttons we chose are five-wire circular keyboard modules, as shown in the figure below:
The five-line circular keyboard module has the following advantages:
● Reasonable layout: The circular arrangement conforms to the natural movement trajectory of fingers.
● Excellent touch: The keys have a crisp sound and clear feedback.
● High sensitivity: Fast response with no sense of delay.
● Ergonomic design: The key travel and pressing force have been optimized, making it not easy to get tired during long-time operation.
These buttons will serve as a supplement to touch operations, providing users with more diversified options for interaction methods.
To meet the requirement of long-term use of portable devices, we have chosen a 5600mAh 5V lithium battery as the power solution for the following reasons:
● High energy density: 5600mAh capacity ensures that the device can work continuously for a long time.
● Stable output: 5V output voltage perfectly matches the device requirements.
● Reusable: Supports repeated charging and...
Read more »During the process of selecting a screen for the SensorCam P4, after our comprehensive evaluation and multiple rounds of testing, we finally selected two representative display screens:
Display A: 2.83-inch 480×640 resolution non-full lamination TFT-LCD screen, supporting 16.7M color display capability and touch function, with a 40pin 18-bit RGB+SPI interface.
Display B: 2.4-inch 240×320 resolution fully bonded TFT-LCD screen, supporting 263K color display, without touch function, using a 14-pin 4-wire SPI interface.
These two screens represent different product positioning: Display A is more excellent in terms of pixel density, color performance, and functional integrity; Display B has advantages such as low cost, easy procurement, and a narrow-bezel full lamination structure. Below, we will systematically compare their performance in parameters such as static and dynamic resolution, pixel density, brightness, color gamut, and refresh rate. Considering that the SensorCam P4 is mainly used indoors, we chose to conduct the test indoors.
It can be intuitively seen from the real-shot comparison chart the differences between the two screens in terms of pixel fineness and color reproduction (Display A is on the left, and Display B is on the right):
1.Display complex images to compare the detail expression between the two
It can be seen that Display A is significantly superior to Display B in terms of pixel density and edge sharpness, with stronger detail expression.
2.Display color-rich images to compare color display effects
It is obvious that the overall display effect of Display A is significantly more delicate and has richer color gradations.
3.Display the comparison of thermal imaging UI prediction screens
The performance of the two screens on such images is similar, with no obvious differences.
Display A:Supports MIPI interface, with stable and smooth frame rate performance, fully meeting dynamic display requirements.
Display B:
● When using the screen refresh function for color gradient testing, the frame rate can reach 63fps, which is basically smooth.
● There is obvious stuttering when running LVGL animations, and performance is limited when handling complex graphical interfaces.
The comprehensive comparison results show that there is a significant gap between the two screens:
Advantages of Display A:
● Higher refresh rate and frame rate stability.
● Wider color gamut range and color expressiveness.
● Higher pixel density results in a delicate display effect.
● Support touch interaction functionality.
● Larger display area.
Advantages of Display B:
● Lower procurement costs.
● Simple interface, few occupied pins.
● Narrow bezel full lamination design.
Although Display A features a non-full lamination design and relatively wide bezels, these factors have limited impact on the actual user experience. Considering the high requirements of SensorCam P4 for display quality and user experience, we ultimately chose Display A as the screen solution for this project. Its excellent display performance and touch functionality will provide users with an interactive experience far superior to that of Display B, which is more in line with the project's positioning for high-quality visual presentation.
During the development of AI cameras, we realized that the form of AI cameras is far more than just RGB or depth cameras. "Specialized cameras" such as single-point ranging and thermal imaging do not rely on complex image processing, yet they can directly capture key physical information like temperature and distance, providing us with a new perspective to observe the world. This inspired us: can we create a lightweight hardware device that makes these efficient sensing capabilities more user-friendly and directly integrates them with camera images or even AI detection results, thereby achieving a wide range of functions?
Based on this concept, we propose SensorCam P4 — a modular sensing device with a camera at its core. This device is based on the high-performance ESP32-P4 main control, and its core capabilities are realized through pluggable expansion backplanes. There is no need to reflash the firmware; you only need to insert the corresponding sensing module according to your needs to quickly expand the functions, such as:
SensorCam P4 adopts a highly modular design, getting rid of the cumbersome drawbacks of traditional multi-sensor integration solutions and focusing on the in-depth integration of camera images and sensing data. The camera is no longer just "seeing colors"; it can also let you "see temperature", "see distance" and so on, making it easy to obtain multi-dimensional data. You can also customize and add various modules according to actual needs. The device can automatically identify the type of inserted sensor and load the corresponding exclusive UI interface. For example, in thermal imaging mode, you can choose to display in overlay or split-screen; in ranging mode, it displays values and aiming reference lines, etc. It looks roughly like this
Why choose ESP32 P4 as the main controller?
Because its characteristics are highly consistent with the core requirements of the device - efficiently processing camera data, handling AI tasks, and achieving sensor fusion - which is specifically reflected as follows:
1. Native camera and display support
2. Powerful processing capability and built-in AI capability
3. Rich connectivity capabilities
4. Mature development environment
reCamera Monitoring Interface Product Research
This is both a research sharing post and a discussion topic. As Makers/consumers, what features do you want in a network monitoring interface? Feel free to leave a comment below, or provide suggestions in our community (https://github.com/Seeed-Studio/OSHW-reCamera-Series/discussions). If your suggestion is adopted, we will give you a product as a gift when reCamera is launched.
Motivation: As the first page users interact with reCamera, it should present a sufficiently clear, powerful, and interactive interface.
When users use smart AI cameras (or remote network cameras), what do they want to see from the interface?
Product Goals: Clarity, interactivity, replaceable and expandable functions, and result output information.
User Needs:
Why would users buy an AI camera instead of a traditional IPC?
What are the advantages of AI cameras?
More expandable? More tailored to their own needs? Able to intelligently identify and alarm?
Expandability is reflected in:
1. Detection models can be replaced and self-trained.
2. Detection logic can be customized, including defining event triggering logic and selecting detection areas.
3. Output results can be exported and easily integrated into developers' own programs.
Event triggering and alarming are important functions of surveillance cameras.
User needs in different scenarios:
Home users: Hope to detect abnormal situations at home in a timely manner through intelligent recognition, such as strangers breaking in, fire hazards, etc., and expect the interface to be simple and easy to operate.
Enterprise users: Need comprehensive monitoring of production sites, office areas, etc., requiring intelligent recognition of production violations, personnel attendance, etc., and hope to couple with the enterprise's own management system.
Developer users: Focus on the product's expandability and secondary development capabilities, hoping to replace detection models, self-train models, and integrate output results into their own programs.
Market Cases:
Most products from large companies are toB types, making it difficult for users to conduct secondary development.
Hikvision Algorithm Platform:
Self-developed platform with drag-and-drop processing steps (operators written by Hikvision, mostly for industrial processing, presumably traditional CV). However, Hikvision's AI Open Platform provides one-stop self-training and deployment.
Hikvision AI Open Platform Case: Detection of masks and chef hats in the kitchen.
If we focus on expandability, providing a secondary development platform is crucial.
DJI Osmo Series:
The architecture is mainly host-downloaded software + camera pure streaming. It connects via Bluetooth and transmits images through network protocols, while the setting terminal and interface run entirely on the host, reducing the burden on the end side.
TP-Link:
Network configuration, storage information, event triggering, camera resolution.
Login directly by entering the IP.
It also provides official software to monitor multiple images.
VCN 19 - Computer client remote monitoring method - TP-LINK Visual Security
Edge Computing Box - AI Algorithm Box - AI Edge Box - Kunyun Technology:
Supports SDK interfaces, mainstream frameworks such as PyTorch, customization, 4 Tops computing power, 4K@60fps, but more like re.
Content and Functional Requirements:
Functions marked in yellow are relatively rare or even non-existent in the market.
Basic Part:
- Video stream display (different code streams can be selected, low code stream has higher fluency)
- Display IP address and current time on the video screen
- Basic operation controls: pause/play, record, screenshot, audio switch, PTZ (this part is an extended function, linked with Gimbal) (pan-tilt control, direction keys + zoom slider (if available)).
... Read more »Why Deploy Video Detection Models on Embedded Devices?
When we talk about visual AI, many people first think of high-precision models on the server side. However, in real-world scenarios, a large number of video analysis requirements actually occur at the edge: abnormal behavior warning of smart cameras, road condition prediction of in-vehicle systems... These scenarios have rigid requirements for **low latency** (to avoid decision lag), **low power consumption** (relying on battery power), and **small size** (to be embedded in hardware devices). If video frames are transmitted to the cloud for processing, it will not only cause network delay but also may lead to data loss due to bandwidth limitations. Local processing on embedded devices can perfectly avoid these problems. Therefore, **slimming down** video detection models and deploying them to the edge has become a core requirement for industrial implementation.
Isn't YOLO Sufficient for Visual Detection?
The YOLO series (You Only Look Once), as a benchmark model for 2D object detection, is famous for its efficient real-time performance, but it is essentially a **single-frame image detector**. When processing videos, YOLO can only analyze frame by frame and cannot capture **spatiotemporal correlation information** between frames: for example, a "waving" action may be misjudged as a "static hand raising" in a single frame, while the continuous motion trajectory of multiple frames can clarify the action intention.
In addition, video tasks (such as action recognition and behavior prediction) often need to understand the "dynamic process" rather than isolated static targets. For example, in the smart home scenario, recognizing the "pouring water" action requires analyzing the continuous interaction between the hand and the cup, which is difficult for 2D models like YOLO because they lack the ability to model the time dimension.
Basic Knowledge of Video Detection Models: From 2D to 3D
A video is essentially four-dimensional data of "time + space" (width × height × time × channel). Early video analysis often adopted a hybrid scheme of "2D CNN + temporal model" (such as I3D), that is, first using 2D convolution to extract single-frame spatial features, and then using models like LSTM to capture temporal relationships. However, this scheme does not model spatiotemporal correlations closely enough.
**3D Convolutional Neural Networks (3D CNNs)** perform convolution operations directly in three-dimensional space (width × height × time), and extract both spatial features (such as object shape) and temporal features (such as motion trajectory) through sliding 3D convolution kernels. For example, a 3×3×3 convolution kernel will cover a 3×3 spatial area in a single frame and also span the time dimension of 3 consecutive frames, thus naturally adapting to the dynamic characteristics of videos.
Why Introduce Efficient 3DCNNs Today?
Although 3D CNNs can effectively model video spatiotemporal features, traditional models (such as C3D and I3D) have huge parameters and high computational costs (often billions of FLOPs), making them difficult to deploy on embedded devices with limited computing power (such as ARM architecture chips).
The **Efficient 3DCNNs** proposed by Köpüklü et al. are designed to solve this pain point:
1. **Extremely lightweight design**: Through technologies such as 3D depthwise separable convolution and channel shuffle, the model parameters and computational load are reduced by 1-2 orders of magnitude (for example, the FLOPs of 3D ShuffleNetV2 are only 1/10 of ResNet-18) while maintaining high precision;
2. **Hardware friendliness**: It supports dynamically adjusting model complexity through "Width Multiplier" (such as 0.5x, 1.0x) to adapt to embedded devices with different computing power;
3. **Plug-and-play engineering capability**: The open-source project provides complete pre-trained models (supporting datasets such...
In retail stores, exhibition halls, and other scenarios, real-time grasp of the number of people entering the store and customers' stay time is the core basis for optimizing operation strategies. Based on the RV1126B edge computing platform, we have built a lightweight people flow detection Demo - through camera capture and local processing, to achieve accurate passenger flow counting and stay analysis. More importantly, the technical logic of this solution is deeply bound to the application scenarios, ensuring both detection accuracy and adaptation to actual operational needs.
Display of Detection Results:
Generate Statistical Result Text
The core idea of this Demo is to let the system "observe - judge - record" like the human eye, which is specifically divided into three steps:
After the camera captures the frame in real-time, the system will first filter out the target of "people" - through a pre-trained YOLO 11 model, eliminate irrelevant objects such as goods and shelves, and only focus on the human contour. Even if the light is flickering (such as backlight at the store entrance, exhibition hall lighting switching), the system can still stably track pedestrians, ensuring that there will be no missed viewing or misjudgment due to light problems.
After identifying the portrait, the system will assign a "temporary ID" to each person and track their movement path in real-time. For example, when someone walks in from the door, the system will mark "enter" and start timing. It is worth mentioning that the tracking algorithm has done ID storage processing. Even if the target is lost, when it reappears in the picture, the system will not count repeatedly, but only continue timing, thus avoiding the disadvantage of traditional infrared sensors that "record once when blocked".
When a person enters the monitoring area, the system will automatically start timing until they leave the area. At the same time, it will count data such as "average stay time" and "maximum stay time", which are of great practical value. For example, in retail scenarios, it can show how long customers stay in front of which shelves; in exhibition halls, it can analyze which exhibition areas are the most attractive. The timing logic does not rely on network time, but is accurately calculated based on the frame rate of the picture, ensuring that even if the network is disconnected, it can be accurately recorded.
All calculations (recognition, tracking, timing) are completed on the RV1126B chip and do not need to be transmitted to the cloud. This means lower latency, and there will be no data lag due to network cotton, which is particularly important for stores that need to adjust manpower in real-time.
The system has built-in adjustable "confidence parameters": in crowded supermarkets, the recognition threshold can be increased to avoid misjudgment when crowds are crowded; in boutiques with fewer customers, the threshold can be reduced to ensure that every customer entering the store is recorded. At the same time, we can also set "effective areas" (such as only counting people entering the store, ignoring pedestrians passing by the door), which can be adapted to the layout of different venues through simple configuration.
According to different needs, two implementation paths are provided:
- If you need to quickly test the effect, you can run it in Python script mode, complete the configuration within a few minutes, which is suitable for makers to quickly verify ideas;
- If you pursue long-term stable operation, you can cross-compile based on C++ language to generate efficient execution files, reduce power...
Read more »When exploring "How to build an AI camera", clarifying its core functional positioning is crucial. AI cameras have a wide range of application scenarios, whether it is real-time monitoring in intelligent security, detail capture in industrial quality inspection, or dynamic recording in home care. The three major functions of RTSP streaming, photo shooting, and video recording are core pillars, directly determining the practical value of AI cameras in different scenarios.
Currently, many related products and projects face significant pain points in practical applications: RTSP streaming is prone to instability and excessive latency, which greatly affects scenarios requiring real-time feedback (such as security monitoring and industrial assembly line monitoring); when taking photos and recording videos, problems like frame freezes and blurriness occur frequently, seriously impacting the experience for both home users recording life moments and enterprises using them for document shooting and scene preservation. Therefore, a stable RTSP streaming solution combined with smooth photo-taking and video-recording capabilities is a core element in creating an excellent AI camera.
After comparing multiple products, we chose the RV1126B chip as the core processor for testing. The reasons for selecting it are mainly twofold:
First, its RTSP streaming function is stable, supporting both H.264 and H.265 encoding formats, and it can adjust the bitrate according to network conditions and device performance, outputting multiple streams to adapt to different network environments, whether it is a home network with limited bandwidth or a demanding industrial local area network;
Second, the RV1126B has excellent ISP image processing capabilities, ensuring the clarity of photos and the smoothness of video recording, providing high-quality output for both long-distance capture in security scenarios and dynamic video recording in home scenarios.
In conclusion, the powerful encoding/decoding and network streaming capabilities of the RV1126B make it a strong candidate for AI Cameras.
Below, I will test the three major functions of the camera: taking photos, recording videos, and RTSP streaming. I will examine the clarity, color contrast, and level of detail presentation of the photos taken, and at the same time, I will introduce photos taken by the iPhone 15 at 4K resolution as a benchmark for comparison. For videos, I will focus more on picture smoothness, frame rate stability, and storage costs. For the RTSP part, streaming stability, anti-interference ability, and real-time performance are the points I value.
As an AI Camera, its photo quality is crucial. The original RV1126B is equipped with a photography unit that supports 4K resolution, fully meeting the needs of security and visual inspection. Below is a comparison of images taken by the RV1126B and the iPhone at the same 4K resolution. We can see that the image from the RV1126B has slight edge distortion, and the color contrast is slightly inferior to that of the iPhone. However, this is partly because the iPhone performs automated background processing on the images, enhancing color contrast and correcting distortion. In terms of image clarity, the effect captured by the RV1126B is no less than that of the iPhone.(Because HACKADAY limits the size of the uploaded file, the images are compressed)
It is worth mentioning that on the IP Camera's website, users can adjust parameters such as brightness, contrast, exposure, and backlight compensation to obtain higher-quality images.
The quality of video recording is an even more crucial basic indicator for a camera, as it directly affects the accuracy of visual algorithm recognition. A good camera must have excellent clarity and detail restoration capabilities of the recorded frame,smooth images, stable frame rates, and lower storage costs.
The video...
Read more »
Create an account to leave a comment. Already have an account? Log In.
to follow this project and never miss any updates