Offloading Vision Processing
Overview
One of the weaknesses of ARCore is that it restricts developers to only using features that are supported by ARCore. Granted, ARCore is a very powerful tool when used for its intended purpose - it makes it extremely easy to place virtual objects on flat surfaces, use 2D markers to guide the generation of holograms, even mimic real-world lighting through easy-to-access API function calls. However, the supported features of ARCore does not cover the entire domain of augmented reality. One example is 3D object recognition. There are times when an AR app developer may want to place holograms over real-world objects without having to print out specific markers or patterns. This is something that the Vuforia SDK supports. ARCore, however, does not. In this instance, a developer would be forced to switch from ARCore to Vuforia in order to use this feature in his app. What if there were a way to get ARCore to perform object recognition, even though it does not support that feature? More broadly, what if you wanted to implement other types of computer vision algorithms in an ARCore app? AR is, after all, a highly visual technology. This tutorial will teach you how to do exactly this. In this tutorial, you will learn how to implement a very basic form of 3D object recognition within ARCore by offloading computer vision processing from ARCore to a remote computer. In other words, ARCore will make another computer do the heavy work of recognizing objects within images, and simply wait to be told by that computer where to place the holograms. Here is a video of what the final product will look like:
Why offload to a remote computer?
In this tutorial, you will be separating the computer vision logic from the rest of the AR logic. You may be wondering why it is necessary to offload image processing to a remote computer, instead of simply running computer vision algorithms in parallel on the phone itself. It may indeed be possible to do so. However, a typical phone does not have a powerful enough processor to handle running an AR app while simultaneously performing potentially computationally-heavy computer vision processing. Offloading the computationally-heavy work to a more powerful computer has the benefit of allowing you to integrate arbitrarily complex computer vision algorithms. With a sufficiently powerful remote computer, you could even potentially integrate deep learning into ARCore!
Hardware requirements
ARCore-compatible Android phone (A list of supported devices can be found here)
Windows computer (must have Python3 installed)
Some object with a single solid, vivid color (the example from this tutorial uses the blue lid of a pill bottle).
Pre-requisites
Complete a getting-started tutorial for ARCore, (my intro tutorial will work fine).
Be familiar with writing and running Python scripts from your computer.
Be familiar with Unity's basic functionality (The Hierarchy/Inspector menus, importing 3D models, creating scripts)
The process
The ARCore 3D object recognition pipeline you will implement works as follows:
Run a basic ARCore application that detects flat surfaces.
For each image frame, extract the image and send it to a remote computer.
Perform color-based object detection to find the image coordinates of the object of interest on the remote computer.
Send these image coordinates to ARCore.
Within ARCore, place a hologram at those image coordinates.
Repeat for each subsequent image frame.
This process can be split into three components:
The ARCore app
The object-recognizing computer vision code
The communication network
Step 1 - Creating the ARCore App
We will be modifying the "HelloAR" quickstart ARCore tutorial for Unity. To begin, first follow the tutorial here to have the HelloAR sample project up and running on Unity. Play around with the project app, and familiarize yourself with placing holograms on flat surfaces with finger taps. Eventually, these finger taps will be emulated with image coordinates received from remote image processing.
The first change we will make to this sample project is to disable the visualization of the detected planes. This way, the blue and red colors of the plane visualizations will not interfere with our color-based object detection code. Go to the Hierarchy on the left menu, click "Plane Generator", and disable it by unchecking the topmost checkbox in the Inspector in the right menu. "Plane Generator" should now be greyed out. Do the same for the "Point Cloud" object. With these two GameObjects disabled, ARCore will no longer create visualizations of planes when it detects flat surfaces.
Now, we will create the hologram we will be placing over the detected object. For this tutorial, we will import a 3D model of a simple green arrow, which can be acquired here. Once it's been imported, drag it into the scene. A new object should now appear in the hierarchy. Rename it to "arrow" if it isn't already named "arrow". Now, set the scale of the arrow to 0.1 across all three axes. Your scene should now look like this:
Next, we will begin modifying the code. Open HelloAR/Scripts/HelloARController.cs. Look in the Update() function. Comment out everything in the Update() function after "_UpdateApplicationLifecycle()". This will disable touch input and prevent the user from creating holograms by tapping the screen.
Now, download the TCPClient.cs file linked here, and place it in the same directory as HelloARController.cs. Create a new empty object called NetworkManager, and attach TCPClient.cs to it. Now, if you look in the Inspector, NetworkManager has a field under "TCP Client" called "ARMarker". Drag the "arrow" GameObject into this field. Your Hierarchy menu should now look like this:
Make the following changes to TCPClient.cs:
On line 34 (the line that initializes the "client" variable), change the IP address to your own IP address. On Windows, this is done by opening the Command Prompt, entering "ipconfig", and looking for your IPv4 Address. It should look something like this:
On lines 37 and 38, change the values of screenWidth and screenHeight to correspond with your phone's screen resolution.
Now, deploy the app to your phone to be sure that nothing's broken! Right now, the app doesn't do anything, because we have yet to implement the image processing script and the rest of the communication pipeline.
Now, let's step through the code in TCPClient.cs to see what's happening. On a high level, this script performs the following tasks:
Extracts an image at each frame.
Attempts to send each acquired image to the remote PC (specified by the IP address you entered).
Listens for image coordinates from the remote PC.
Places a hologram on a detected flat surface whenever an image coordinate is received from the remote PC.
Let's examine each of these four tasks in detail:
Extracting images
In the Start() function, a RenderTexture and a Texture2D are initialized as renderTexture and virtualPhoto, respectively. In order to extract images from ARCore, one can simply write the camera's render texture to Texture2D object, and then convert that Texture2D object to a PNG. This is done in the GetImage() function, which contains comments explaining the process in detail.
If you look back up in the Update() function, you can see that whenever an image is not already being sent to the remote computer the ARCore app extracts the current image by calling GetImage() on line 57. If the data obtained from this function is large enough to be an image (greater than 4000 bytes), then ARCore attempts to send this image to the remote computer. This brings us to the next task:
Sending images to remote PC
In order to send image bytes to the remote computer, we set up a TCP network. As you can probably tell from this script's filename, the script implements a TCP client, which connects to a TCP server on the remote computer (which we will later implement). The TCP client essentially sends large image byte arrays to the server in "chunks" of 4096 bytes (we assign this size to the "buffer" variable on line 17). The server receives these chunks of data and reconstructs the original image byte array. Because each image byte array has a unique size, we must have some way of telling the server how long the image byte array is, in order for the server to correctly reconstruct the image. This can be done by sending a small header packet containing a 4-byte integer that tells the server how many of the following bytes belong to a single image. Let's examine how this is done in our code...
Picking back up from where we left off in the Update() function, you can see on line 60 a byte array of length 4 called headerData. This is the header packet described before. Just above, you can see that we've defined imgSize as a byte array encoding the image size. We now assign this as headerData, and write headerData to the network stream on line 62. Now, on each subsequent call to Update(), ARCore will continually send chunks of the image byte array until no more are left. This is done in lines 66-85.
Listening for image coordinates from remote computer
Because listening for responses from the server in a synchronous fashion may block and cause ARCore to lag, we run the response listening code in a separate thread, initialized as _thread in line 43. This thread runs the ThreadReceiveMessage() function (implemented on line 108), which listens for messages from the server on an infinite loop. Whenever a message is received, the coordsStr variable is updated with the received coordinates (for example, "500,450"). coordsStr is then used to place hologram.
Placing a hologram using image coordinates from PC
Whenever coordsStr is defined, Unity will try to place an object at those coordinates by calling the PlaceObject function in the Update function (line 52). Let's take a look at the PlaceObject function. It first parses coordsStr into two integers "x" and "y". The code from line 123 onwards is adapted from the code we commented out earlier from HelloARController.cs! Instead of using tap locations to place holograms, it now uses the "x" and "y" coordinates we defined to place the virtual object on the detected plane. The ARCore app is now complete.
Now there are two more components to implement: the TCP server and the object detection code. Both of these will be run on scripts on the remote PC (in our case, our development machine).
Step 2 - Implementing the TCP Server and Object Detection
Download this tcp_serve.py script linked here. Make the following changes:
On line 66, change TCP_IP to your IP address. It should be the same IP you used before, for the ARCore app.
In the process_img function, comment out "segment_img(conn, img)" (line 18) and uncomment "select_color(conn, img)". The function should now look like this:
Now, run tcp_serve.py using Python3 on your development computer. At the same time, run the ARCore app we completed earlier. The Python console should acknowledge that a client (the phone) has connected to the server (the computer). You should see the ARCore's screen being mirrored on your computer monitor.
Aim your phone camera at the object you want to track. Remember, the object should be a single, solid, and vivid color. The object I chose was the bright blue cap of a pill bottle. With the camera still pointed at that object, use your computer mouse to click the object multiple times. The terminal should output a message with each click of the format "hsv: " followed by three numbers. These are the HSV values of the pixels you clicked on. After clicking 10 or so times, Ctrl-C out of the tcp_serve script and exit the ARCore app. Examine the HSV values you just generated. Then, modify lines 39 and 40 of tcp_serve.py so that the lower and upper bounds of the HSV colors we want to track accurately reflect the data you just generated.
Once this is done, restore process_img to its original state:
Now, if you run the script again and boot up the ARCore app once more, you will see than an arrow is now hovering over the target object! Even if you move the camera or the object, the arrow will follow. It should look something like this:
Now that you're convinced that the script works, let's dissect it. The script should be well-commented, so I'll keep the explanations broad.
The run_server() function creates the TCP server with your computer's IP address and connects to the ARCore client. It also handles the image reconstruction process. In an infinite loop, it does the following:
Receive a data packet from the client.
If the data packet has size 4, it is a header packet. Extract the size of the incoming image from this header packet. We now know how many bytes to gather from future packets in order to construct our image.
Until we achieve the target number of bytes, continue appending incoming data packets to the curr_img_bytes byte array.
When the target number of bytes has been added to curr_img_bytes, process this image byte array with the process_img function.
Wait for the next header packet.
So what does the process_img function do? You'll recall that this is where we toggled between the "select_color" and "segment_img" functions. process_img first converts the image byte array into an OpenCV-compatible image format, before passing it into either "select_color" or "segment_img" (depending on which one is commented out). Here's what each of these two functions do:
select_color simply shows the image on the screen and sets up a mouse callback to output the HSV values of any pixel you click. This is used to acquire the lower and upper HSV bounds that is required for color-based object detection.
segment_img uses these hard-coded lower and upper bounds to threshold the image, creating a mask:
Once the mask is obtained, the center of mass of the white blob is calculated to obtain the image coordinates of the object. These image coordinates are scaled up to the smartphone's screen resolution, encoded as a string, and then sent to the ARCore client. You know the rest of the story.
And there we have it!
We have successfully implemented a rudimentary form of 3D object detection for an ARCore app. Once you've spent enough time absorbing the contents of this tutorial, it would be worth playing around with more complex methods of object detection. The beauty of offloading vision processing from ARCore to a remote computer is that the computer does all the image processing. In other words, any computer vision that can be performed on a PC can now be performed within an ARCore app. The possibilities are endless!