I have been accepted into the AWS Community Builders program. The program provides resources and networking opportunities across a range of AWS services. My focus will be in the machine learning category, but I hope to investigate other topics such as IoT and web applications. I will be sharing what I learn here, on BearID Project and on my new DEV Community blog.
For my initial thoughts on what I’m hoping to get out of the program, check out my first post on the DEV Community, AWS Community Builders: My First Step.
]]>As part of Microsoft’s AI for Earth program, BearID was invited to apply for an Azure Percept Pilot Grant. The grant provides an Azure Percept DK, a development kit for edge AI. The goal will be to see if we can run aspects of bearid on this IoT device.
Once the project is complete, I will have the opportunity to write a post on an Azure IoT blog. I’ll link to it from this blog when it’s ready, probably in early 2022.
]]>In my previous post, No-Code Image Classification with Azure Custom Vision, I introduced the Microsoft Azure’s Cognitive Services for vision. More specifically, I described how the BearID Project could use Azure Computer Vision as a brown bear detector through an API. I also went into some detail on how to use Azure Custom Vision to build an image classifier without writing a line of code.
This time I’ll talk about using the same service to build an object detection model, similar to our current bearface program. Like with classifiers, you can utilize this service through the web portal or programmatically, using the Custom Vision SDK in your language of choice. I will mainly focus on the Custom Vision SDK for Python, although we’ll have a look at a few web portal features as well.
In the bearid
application, we use an object detector, bearface
, to find bear faces in images, as well as identify the eyes and nose (see Bear (C)hipsterizer). Azure Custom Vision does not support finding landmarks like the eyes and nose, so we will only worry about finding the faces. Following the Quickstart: Create an object detection project with the Custom Vision client library, we will use the Python SDK do the following:
I ran everything on my local Linux machine, otis, but you can also run this in a cloud instance. To get started, see the Quickstart guide and follow the prerequisites and Setting up sections.
Setting up walks you through the creation of variables containing various resources and subscription keys. Use those variables to set up trainer and predictor credentials:
credentials = ApiKeyCredentials(in_headers={"Training-key": training_key})
trainer = CustomVisionTrainingClient(ENDPOINT, credentials)
prediction_credentials = ApiKeyCredentials(in_headers={"Prediction-key": prediction_key})
predictor = CustomVisionPredictionClient(ENDPOINT, prediction_credentials)
Next we need to set up the project type (ObjectDetection) and domain (General) for the model we will create. Then we can create the project:
obj_detection_domain = next(domain for domain in trainer.get_domains() if domain.type == "ObjectDetection" and domain.name == "General")
print ("Creating project...")
project = trainer.create_project("face-resize", domain_id=obj_detection_domain.id)
Custom Vision supports the following domains for object detection: General, Logo, Products and Compact. Pick the domain that most closely matches your use case. For example, if you are looking for your company logo in images, use the “Logo” domain. The “Compact” domain is optimized for edge devices. I will explore compact models in a future post. If none of the other domains are appropriate, select “General”. For more information on domains, see Select a domain for a Custom Vision project.
Tags are the same thing as labels. In object detection, there is a tag for each type object to be detected (e. g. car, person, bicycle, etc.). You need to create a tag for each type of object in your dataset. For this project, we are only aiming to detect bear faces, so we only need to define one tag which we’ll call bear
:
bear_tag = trainer.create_tag(project.id, "bear")
Now we need to upload the dataset to Azure Custom Vision. Our dataset has 3730 images of bears with each bear face identified with a bounding box. You can upload and label the data using the web portal, much like I did in the previous classification example. In fact, the Custom Vision interface can provide suggestions for bounding boxes for many common objects (including bears!). Still, labeling 3730 images takes time, and I already have the data labeled. I will upload the images and labels using code.
There are many different formats for data labels. For bearid
, we use a format defined by dlib’s imglab tool. The XML file for one image looks something like this:
<dataset>
<images>
<image file="image.jpg">
<box height="200" left="1172" top="1059" width="200">
<label>bear</label>
<part name="leye" x="1324" y="1132" />
<part name="nose" x="1279" y="1197" />
<part name="reye" x="1246" y="1133" />
</box>
</image>
<image file="/home/data/bears/imageSourceSmall/britishColumbia/melanie_20170828/bc_bellanore/IMG_5878.JPG">
<box height="211" left="593" top="462" width="211">
<label>bear</label>
[...]
</box>
</image>
[...]
</images>
</dataset>
Mainly, we care about the image file
and box
information. In our case, all the label
entries are bear
, which we have already defined as our tag. As I mentioned before, we will not be utilizing the landmarks (eyes and nose), so we can ignore all the part
items. To access the XML data, we use the ElementTree XML API. We have a utility library xml_utils, which helps us with parsing the XML. I have bearid
cloned at ~/dev/bearid
. Let’s import xml_utils
and a few other common libraries:
import sys
sys.path.append('~/dev/bearid/tools')
import xml_utils as x
from collections import defaultdict
from PIL import Image
Now we can read in the XML file and load the objects from it using the load_objs_from_files
function in xml_utils
.
objs_d = defaultdict(list)
x.load_objs_from_files(['faceGold_train_resize.xml'], objs_d, 'faces')
The Custom Vision API allows you to upload images in batches of 64. So lets set up a constant for the batch size and keep track of the current image_list
and image_count
:
MAX_IMAGE_BATCH = 64
image_list = []
image_count = 0
The next block of code is the largest (and messiest) part of this example. It could certainly be cleaned up by defining some functions. Here’s what is does:
key
is the label, in this case there is only bear
)objs
for one key
(an obj
in this case is an image)Here’s the code:
# loop through all the labels and get their corresponding objects
for key, objs in list(objs_d.items()) :
obj_count = 0
obj_size = len(objs)
# loop through objects (images) for each label
for obj in objs :
image_count += 1
obj_count += 1
file_name = obj.attrib.get('file')
img = Image.open(file_name)
width,height = img.size
regions = []
# find all the regions (bounding boxes)
for box in obj.findall('box') :
# find the bounding box coordinates
bleft = int (box.attrib.get('left'))
btop = int (box.attrib.get('top'))
bheight = int (box.attrib.get('height'))
bwidth = int (box.attrib.get('width'))
# add bounding box to regions, and translate coordinates
# from absolute (pixel) to relative (percentage)
regions.append(Region(tag_id=bear_tag.id, left=bleft/width,top=btop/height,width=bwidth/width,height=bheight/height))
# add object to the image list
with open(file_name, "rb") as image_contents:
image_list.append(ImageFileCreateEntry(name=file_name, contents=image_contents.read(), regions=regions))
# if this is the last image or if we hit the batch size
# then upload the images
if ((obj_count == obj_size) or ((obj_count % MAX_IMAGE_BATCH) == 0)):
upload_result = trainer.create_images_from_files(project.id, ImageFileCreateBatch(images=image_list))
if not upload_result.is_batch_successful:
print("Image batch upload failed.")
# if the errors was a duplicate, keep going; otherwise exit
for image in upload_result.images:
if ((image.status != "OKDuplicate") and (image.status != "OK")) :
print("Image status: ", image.status)
exit(-1)
image_list.clear()
obj_count = 0
obj_size -= MAX_IMAGE_BATCH
You can view your labeled dataset in the web portal:
You can also use the web portal to edit your labels as needed.
Once your dataset is ready, you can set up your trainer and loop through the training iterations. A sleep
command is added to wait for some time during each loop.
import time
iteration = trainer.train_project(project.id)
while (iteration.status != "Completed"):
iteration = trainer.get_iteration(project.id, iteration.id)
print ("Training status: " + iteration.status)
time.sleep(1)
Alternatively, you can run the training using the web portal, much like in the classification example.
Once training is complete, you can see the cross-validation performance on the web portal:
In this case, for a probability threshold of 50% and an overlap threshold of 50%, we are getting 99.4% mean Average Precision. Since we have only one label, bear, the Average Precision for it is also 99.4%.
Once you are happy with your trained model, you can publish it. Publishing makes it available as a prediction endpoint which can be called from your SDK based code. Publishing is a matter of naming your iteration and calling the publish_iteration
function.
publish_iteration_name = "bearfaceModel"
trainer.publish_iteration(project.id, iteration.id, publish_iteration_name, prediction_resource_id)
Azure Custom Vision is a quick and easy way to build and deploy classification and object detection models. The web portal provides a no-code way to experiment with you dataset, but if you want to implement something more significant, the Custom Vision SDK is the way to go. With it, you can use your favorite language to upload and label data, train a model and publish it as a prediction endpoint.
Until net time, SBD.
]]>In 2019, the BearID Project received a grant from Microsoft’s AI for Earth program. This grant provides access to AI tools and Azure compute resources to advance our research in noninvasive techniques for monitoring brown bears. For the past year, we have been focused on developing our application using our local deep learning machine, otis, and writing our first paper on the project (more on the paper in a future post). Fortunately, the AI for Earth program has extended our grant for another year, and this time we are making use of it!
The AI for Earth grant provides credits for Microsoft Azure. Azure includes a wide range of cloud services for building, testing, deploying and managing applications and services. Azure’s AI products range from low level infrastructure services, such as storage and compute, up to fully-managed Cognitive Services, such as speech translation and computer vision.
The Computer Vision service provides powerful, pre-trained machine learning models for computer vision applications. With a simple API call, you can extract a wealth of context from any image without any knowledge of machine learning. The Computer Vision service already knows about “brown bears”, so we could use this to find bears in photos or camera trap video frames before sending them to the bearid
application. You can test the API on the Computer Vision webpage. Testing with one of our images, F011 - GC and cub from Glendale Cove, we received the following results.
The API returns the extracted context in a JSON structure. Examples are provided in a wide array of languages, including Python.
Pricing for the API varies depending on a number of factors (compute region, feature context and calling frequency), ranging anywhere from free (20 transactions per minute up to 5,000 per month) to $2.50 per 1000 images (10 transactions per second for full text description or text recognition). While the Computer Vision APIs are easy, powerful and cost effective (depending on your application), they are fixed function.
If you have a classification or object detection computer vision problem that is not covered by the Computer Vision APIs and you have data to train a model but you don’t want to mess around with virtual machines, then the Custom Vision service might be right for you. This service lets you build and deploy your own image classifiers and object detectors in a few easy steps:
You can utilize this service through the web portal or programmatically, using the Custom Vision SDK in your language of choice. For a quick test, I used the web portal to create a classifier to identify a subset of bears using the bear face chip images.
In the bearid
application, we find, extract and normalize the bear faces we find in images (see Bear (C)hipsterizer). We have a set of these face chips as 150x150 pixel JPEGs. As a start, I followed the instructions in Quickstart: How to build a classifier with Custom Vision to create a classifier for 10 bears in our dataset. I created a new project with the following parameters:
The next step is to upload and label the dataset. I used “Add Images” and selected all the face images for a single bear. In the “My Tags” (aka labels) field I entered the identification of the bear (e. g. amber, bella, etc.). I did this for 10 bears from our dataset, resulting in the screen capture shown above. In this “Training Images” tab, you will see all the images with the tags and counts on the left.
Now it’s time to train the model. Click the green “Train” button. In this case I used “Quick Training” just to see some results. In less than a minute it was done. On completion, the page automatically switched to the “Performance” tab showing the model performance using cross-validation data.
You can see that even with less than 1 minute of training we are getting >80% on precision, recall and AP. By default, these numbers are shown using a “Probability Threshold” of 50%. There is a slider at the top of the left-hand pane which you can use to vary the threshold.
Beneath the general results you can find the “Performance Per Tag” (see image to right). Notice a warning icon by the “Image count” heading. Mousing over the icon brings up a message stating:
Unbalanced data detected. The distribution of images per tag should be uniform to ensure model performance.
You can also see a number of the image count bars are red. Mousing over these shows the following message:
We recommend having at least 50 images per tag to ensure model performance.
Clicking on a “Tag” link will show you which images were used for cross-validation for that tag and if they were classified correctly or not. Clicking on an image within that cross-validation set will show you the classification results for that image. For example, here are the results of two images on the bear “Also”, one is correctly identified by the model, and one incorrectly identified as Bella:
Correct | Incorrect |
---|---|
Now would be a good time to add images to the data set to correct the imbalanced data and insufficient images per tag. We can then train new iterations using the “Quick Training” again, or we can specify a time budget (currently pricing is $20 per compute hour). I’ll skip this for now.
The final step is to deploy the model. Deploying is as simple as clicking the “Publish” button on the “Performance” tab. You can further test you model on the “Prediction” tab using “Quick Test” then uploading an image or entering a URL. You can choose to add these additional images to the dataset for use in subsequent training iterations. Once you have published a trained iteration, the Custom Vision service will generate an API. You can get the Prediction URL and Prediction Key (needed when you call the API) by clicking the “Prediction URL” button. You can now call this API from your application in the your language of choice. The API will return a JSON string containing the results, similar to the Computer Vision API described earlier.
In less than 30 minutes and for a few dollars (depending on your dataset; or for free on a Free Tier), you can train and deploy a Custom Vision model without writing a single line of code. For reference, it took me longer to write this blog than it did to perform all the steps on the Custom Vision portal! If you have a lot of data, you may find using the web portal a bit tedious. At this point you probably want to start using the Custom Vision SDK. In the next installment, I’ll take a look at implementing a bear ID classifier and bear face object detector using the SDK.
Until net time, SBD.
]]>If you have been following the BearID Project, then you know we have developed an application to identify individual bears from photographs. The application is published on GitHub as bearid and the supporting deep learning networks are published at bearid-models (currently trained with bears from Katmai National Park in Alaska and Glendale Cove in British Columbia). Running the application is fairly simple, you call it like this:
bearid.py <image_file/directories>
where <image_file/directories>
is the path to your images or a directory. It seems easy enough!
However, to get to that point, you need to download and build all the bearid
binaries. This requires installation of various libraries like the Open Basic Liner Algebra Subprograms library (OpenBLAS), the Boost library (boost) and Dlib. We developed this using the Linux machine, nicknamed Otis, which we put together a few years back (see Building a Deep Learning Computer), so you need to deal with tools like GNU C++ Compiler and cmake and worry about compatibility with the X Windows System. Now it’s starting to get complicated.
The aim for the BearID Project is for it to be used by non-computer scientists, like our conservation biologist, Dr. Melanie Clapham. Melanie uses a Windows laptop in the field, and Mary and I use Otis or our MacBooks. One option could be to migrate everything to the cloud. We could then support a single OS there. Unfortunately, when Melanie is in the field, her Internet access is very limited, so having something running on her laptop would be extremely beneficial. So now what?
Docker is a set of services that utilizes virtualization to deliver software in containers. A Docker container is a virtual machine that can run within the Docker Engine across a number of supported host operating systems. Everything needed to run inside the container is delivered as a Docker image. The Docker image includes the operating system, libraries and application. Once you have an image, to can run it on top of any host operating system that supports the Docker Engine, including Linux, Windows and macOS. It can run on a local machine or in the cloud. Once we have a docker container, we can run it just about anywhere we need to. So let’s get started…
I’ll assume you are already familiar with Docker. If not, I recommend Getting Started with Docker. For more details, check out the Docker overview.
For our journey, the first step was to build up an image. This started with a Dockerfile, which tells Docker where to get all the components you want to use in your image. There are a lot of images you can start with, have a look on Docker Hub. Our main program, bearid
, is Python 3 code and we prefer Linux. So we started with a Debian image with Python installed, called python:3.7-slim. You pull this in to your image using the FROM
command in your Dockerfile like this:
FROM python:3.7-slim
Next, you add in all the packages you need for your application. Use Docker’s RUN
command to call the relevant OS commands. For Debian, this means using apt-get
for the packages. Our core application is C++ and uses dlib. To build it, we need tools like cmake and wget and libraries like Boost and BLAS. While we won’t have a GUI, part of out application does draw in image buffers using some X11 tools, so we need that too. So after our FROM
call, we have something like:
RUN apt-get -y update \
&& apt-get install -y build-essential cmake \
&& apt-get install -y wget \
&& rm -rf /var/lib/apt/lists/*
RUN apt-get -y update && apt-get install -y libopenblas-dev liblapack-dev
RUN wget -q https://sourceforge.net/projects/boost/files/boost/1.58.0/boost_1_58_0.tar.bz2 \
&& mkdir -p /usr/share/boost && tar jxf boost_1_58_0.tar.bz2 -C /usr/share/boost --strip-components=1 \
&& ln -s /usr/share/boost/boost /usr/include/boost
RUN apt-get -y update && apt-get install -y libboost-all-dev
RUN apt-get -y update && apt-get install -y libx11-dev
We are using dlib 19.7, so we’ll get that next:
RUN wget -q http://dlib.net/files/dlib-19.7.tar.bz2 \
&& tar -xjf dlib-19.7.tar.bz2
We use a tool called imglab
from dlib to create XML files of all the images we will process. So we need to build and install that tool:
RUN cd dlib-19.7/tools/imglab \
&& mkdir build \
&& cd build \
&& cmake .. \
&& cmake --build . --config Release \
&& make install
Now let’s get the bearid
code from our GitHub repo and build it:
RUN git clone https://github.com/hypraptive/bearid.git \
&& cd bearid \
&& mkdir build \
&& cd build \
&& cmake -DDLIB_PATH=/dlib-19.7 .. \
&& cmake --build . --config Release
Finally, we need to get the pretrained models from our bearid-models repo:
RUN cd / && git clone https://github.com/hypraptive/bearid-models.git
Once you have all the components, you need to tell Docker what to run when the container is instantiated. You do this with the CMD
command, which may look something like this:
CMD ["python","bearid.py","/home/data/bears/imageSourceSmall/images"]
Now the Dockerfile is complete. The next step is to build the image using the docker build
command. You may want to tag the image so it is easy to reference. We used bearid
as our tag. The build command is as simple as:
docker build -t bearid .
Now we have all the pieces we need in our Docker image. In fact, we have a little too much in there! Our initial bearid
image was around 2GB! It turns out a lot of the development tools take up a lot of space and aren’t really needed to run the application. So to reduce the size of the final image, we used a staged build. The idea is to build 2 images. The first will have everything we need to build the application and the second will only have what we need to run the application.
The first stage is most of what we had above, but we name the first stage by adding an AS
clause to the FROM
command:
FROM python:3.7-slim AS bearid-build
The first stage includes everything up to and including getting the bearid-models. At that point we start a new image by calling the FROM
command again. You can use a completely different image in your FROM
command if there exists a smaller but compatible image. We decided to stick with python:3.7-slim
. We don’t need cmake or wget, but we do still need the X11, BLAS and Boost libraries, so we apt-get those:
FROM python:3.7-slim
RUN apt-get -y update && apt-get install -y libx11-dev
RUN apt-get -y update && apt-get install -y libopenblas-dev liblapack-dev
RUN apt-get -y update && apt-get install -y libboost-filesystem1.67.0
Next we want to copy the executable we built over to our new image. We do this with the COPY
command. Remember that AS bearid-build
clause we added above, now we use that to tell COPY
where to copy from. We will copy the bearid binaries, bearid.py, imglab and the models to the root of our new image:
COPY --from=bearid-build /bearid/build/bear* /
COPY --from=bearid-build /bearid/bearid.py /
COPY --from=bearid-build /usr/local/bin/imglab /usr/local/bin/imglab
COPY --from=bearid-build /bearid-models/*.dat /
Again we need to tell Docker how to run you code with CMD
. You can also tell Docker where your working directory should be with the WORKDIR
command:
WORKDIR /
CMD ["python","bearid.py","/home/data/bears/imageSourceSmall/images"]
With this staged build approach, our bearid
image ended up being 388MB, about 1/6th the original size! You can find our latest Dockerfile here.
Running an image involves the use of docker run
. There are a couple useful flags to include. For example, -i
keeps STDIN open and -t
allocates a pseudo-TTY terminal. These are especially useful if you need basic interaction (for example, we print some status to the terminal). The --rm
flag is useful to automatically remove the container instance after exiting. Otherwise you end up with a bunch of stopped containers lying around. We also use the -v
flag to bind mount a volume to the container. This is how we pass in the photos we want to identify from the host OS to the container. The argument for -v
looks like <HOST_DIR>:<CONTAINER_DIR>
. Our run command looks like this:
docker run -it --rm -v ~/dev/example/im_small:/home/data/bears/imageSourceSmall/images bearid
Running the bearid Docker image looks like this:
You can see that the bearid container checks for all the relevant files, runs through the underlying programs then prints out the bear ID predictions for the images. In this test case, there were 7 images. The predictions are in the format <PREDICTION> : <IMAGE_NAME>
. Note the images in the list all have _chip_0
in the name. This is actually showing the name of the chip file, containing only the bear’s face, produced by bearchip. If more than one face is found in an image, you would see _chip_1
, _chip_2
, etc.
The text print out is interesting, but you really want to see the image with the boxes and labels. For that, bearid outputs an XML file with all the relevant data and writes it back to the directory containing the source files (Remember <HOST_DIR>
from before?). Along with this XML file is an XSL file which is a stylesheet for the XML which transforms it into something viewable. If you have a browser that supports viewing and loading from local files, you can view this file directly. This works well with Firefox and Internet Explorer (for Chrome and Safari, you may have to jump through a few hoops and disable some security features). Here’s the XML file viewed in Firefox:
Once we have a working Docker image, we can publish it for others to use. That way they don’t even need to run through the build process! For this we created repository on Docker Hub using our hypraptive Docker ID. A published image needs to have a unique name, for which need to use the proper namespace. We tag the the repository with with a string of the form
<DOCKER_ID>/<REPOSITORY_NAME>:<VERSION_TAG>
For our application we used hypraptive/bearid:1.1
. After that, we push the image to the repository using docker push
. You may need to log in to your Docker account using docker login
. The command lines to tag and push our image are:
docker tag bearid hypraptive/bearid:1.1
docker push hypraptive/bearid:1.1
The docker push
command will upload the image to Docker Hub in chunks. The time for this to complete will vary depending on the image size and your upload speed. It’s a good thing we reduced the size of the image using a staged build!
Now that hypraptive/bearid is published on Docker Hub, anyone can run it anywhere Docker is running. You will need an Internet connection to download the image the first time, but after that it will run from your local file system. Running the container looks like before when running locally, except now we use the full namespace (and currently the version tag 1.1):
docker run -it --rm -v ~/dev/example/im_small:/home/data/bears/imageSourceSmall/images hypraptive/bearid-bc:1.1
So far, we have been testing this process on our local Linux machine (Otis) and MacBook laptops. The goal was to enable Melanie to run this on her Windows laptop. To accomplish this, Melanie installed Docker Desktop for Windows on her laptop. She had to fiddle a bit with the amount of memory allocated to Docker images in the Docker Desktop (see the Resource settings for your host platform: Mac or Windows). Our application needs ~3GB to run, but she had less than 2GB available. By scaling the input images down to around 640x480, the bearid image can run in 1.2GB, and it still works pretty well. Once that was set up, she ran the command line above, and voila!
Now that she has this running on her laptop, she can take it into the field and use it remotely! She will only need internet access if we push any new updates to the Docker Hub repository.
If you happen to have photos of bears from Katmai or Glendale Cove and want to see if bearid can identify them, give it a try yourself! Leave a comment if you do.
]]>BearID Project team at Knight Inlet Lodge
The Knight Inlet Lodge is a great place to see brown bears! It is a completely floating facility, anchored near one shore of Glendale Cove. The lodge is only accessible by float plane or by sea. Apparently, most of the floating lodge was purchased from a fishing lodge near Vancouver Island and towed to its current location in 2012. There was a lodge in the Glendale Cove location prior to 2012, but it burned down due to a careless guest smoking a cigarette.
In 2017, the lodge was purchased by the Nanwakolas Council, a First Nations controlled entity mandated with securing economic development opportunities for the benefit of its five limited partner First Nations. One of the First Nations groups also maintain a set of trail cameras on their lands, and provide data to Melanie for her research and for the BearID Project.
Lenore with yearling cubs
For our first day at the lodge we had a fairly fixed schedule. We arrived by float plane from Campbell River in the late morning. We immediately went on a tour of the facilities while the staff handled luggage and departing guests. In June, the bears are mainly using the estuary, so the next step was an introductory estuary tour on a small boat. The boat holds up to 6 guests and a guide. The guide took us across the cove where we found Lenore and her 2 yearling cubs. After watching them forage along the shore for a while, we moved further toward the mouth of the cove where we found Lillian. After about an hour on the estuary, we headed back at the lodge just in time for lunch. After lunch we went out on a larger boat for a Knight Inlet cruise. We sped up the inlet, stopping at waterfalls and various sites along the way.
Knight Inlet cruise
After cruising the inlet, we had a short break back at the lodge. Then we were back on the estuary looking for bears. Melanie was able to join our tour to provide some additional commentary. She knows the bears quite well and was able to tell us some of their histories. In a little under 2 hours we saw Lenore and cubs, Lillian, Flora, Amber and Bella. We also saw a black bear that was hanging out near the lodge. Back at the lodge, we had time to freshen up before happy hour. During happy hour, you have a chance to select your activities for the next day. Besides bear viewing, activities include boat tours of Knight Inlet, whale watching, kayaking and various walking tours. Happy hour was followed by a very nice dinner, including wine and desert. After dinner every night, there is a evening presentation from one of the guides.
On this particular evening, we were the presenters. Melanie, Mary and I gave an overview of the BearID Project to the staff and guests. Melanie described the need for monitoring bear populations and the challenges with monitoring them non-invasively. She described how she utilizes camera traps to help with this task, but that consistently identifying bears in the videos is difficult. This is when we introduced the BearID application. At this point, I gave a high level overview of machine learning and explained the process we use in the BearID application. I’m not sure if everyone was able to follow along, but at least they enjoyed the photos and the clips from camera traps.
On our second morning, we went out on the estuary tour with our guide, Anna. The tide was low and we saw a number of bears foraging for mollusks and crustaceans among the rocks and on pilings. Once again we saw Lenore and her cubs. We saw Toffee, the only male we saw on the trip, who was seemingly being followed by both Lillian and Flora. Lenore was very mindful of Toffee. Every time he would come her way, she and the cubs would scamper away. Further along the estuary we also ran across Cleo foraging along the shoreline.
After the morning tour, Mary and I met with Melanie to discuss our project and the paper we are planning to write. After lunch, Melanie took us to see some of the camera trap setups. We took a boat across the cove to where they have some trucks and a mini bus. Melanie took us in a research truck.
Riding in the research truck
Many of the camera traps are placed near the road or along heavily traveled bear trails. A couple are on bridges which cross the river and creeks. We stopped at some of the cameras along the road to swap out the memory cards and batteries. We continued along the road until we reached the viewing platform which is near where the bears fish for salmon on the Glendale River. The platform is not in use this early in the season as the salmon hadn’t started running there yet. It is similar to how I imagine the one at Brooks Falls must be. There are also a couple of viewing areas along the road up to the main platform.
On the way back down from the viewing platform to the inlet, we stopped at a few more camera sites. Some of the cameras were a couple hundred yards from the road. We had to walk along the bear trails to get to them. Fortunately we didn’t encounter any of the bears while we were out. Melanie also showed us some hair snares, which are basically a few strands of barbed wire wrapped around or strung between trees. The purpose of these snares is to catch hair samples from passing bears which can then be used for DNA testing.
We were back at the lodge in time for happy hour and dinner. On the way from dinner to the nightly presentation, we saw another black bear behind the lodge. The presentation was about tree communication via a fungal network under the ground (see this related story from the BBC). Even though it was still light after the presentation, we were ready to hit the hay. After all, we would have another early start on our third and final day at Knight Inlet.
Rainbow over Glendale Cove
Our last day at Knight Inlet Lodge started with a rainbow over Glendale Cove. We joined another morning estuary tour, this time guided by Bryn. We did not find a pot of gold at the end of the rainbow, but we did find Cleo. She was rambling along the shoreline across from the lodge, flipping rocks in search of breakfast. It’s amazing to see these bears flip over huge rocks as if they weigh nothing. It gives you some perspective on their strength and agility.
In addition to Cleo, we saw another bear eating barnacles off of the pilings. We also saw a number of birds, including a bald eagle fishing. It was quite successful in snatching a good sized fish from the water. If flew to the top of a piling and ate its well earned breakfast.
That final tour went by way too quickly. Before we knew it, we we on a float plane back to Campbell River. Of course we wish the trip to the lodge was longer, but we had a great time. Overall, we saw 9 brown bears: 6 adult females, 1 adult male and 2 yearlings (Lenore’s cubs). We were familiar with the bears’ names, and had seen lots of photos, but we were not able to identify them ourselves. This gave us a new appreciation for what we are trying to accomplish with the BearID Project.
We hope to visit Knight Inlet Lodge again. In the mean time we will be hard at work writing a paper on how far we have come as well as working to improve the application. Hopefully the next time we visit, our application will be able to identify all the bears we see.
]]>This week Dr. Melanie Clapham is presenting the BearID Project at the International Conference on Bear Research & Management in Slovenia. In the abstract for the presentation, titled “Developing Automated Face Recognition Technology for Noninvasive Monitoring of Brown Bears (Ursus Arctos)” (abstract ID 237 in the conference book of abstracts), we list a preliminary accuracy of 93.28 ± 4.9% for our face embedding network, bearembed
. Considering that state of the art human face recognition is well above 99.5% on the Labeled Faces in the Wild (LFW) test set, our number seemed plausible. However, as we were updating our findings for the presentation, we found our story to be quite different.
As stated in previous posts, the BearID Project follows an approach based on FaceNet and utilizes networks and examples from dlib. For bearembed
, we use a network for human face detection implemented in dlib (see this blog post). The example program, dnn_face_recognition_ex.cpp, implements a clustering algorithm using a pre-trained embedding network for human faces. To train the network on bears, we followed another example program, dnn_metric_learning_on_images_ex.cpp.
The premise is to take 2 face images and generate embeddings (run each face image through the bearembed
network). Calculate the distance between the embedding results and compare. If the distance is less than a specified threshold, then the bears are the same. If the distance is greater than the threshold, then they bears are different.
The dlib example has a very rudimentary test set up. It builds a batch of data by randomly selecting M
IDs and randomly selecting N
face images for each ID. The values of M
and N
were both 5. So for one test batch, we would get 5 faces from 5 different bears. Generating all the combinations of 2 from the 25 faces results in 50 matching pairs (both faces are of the same bear) and 250 non-matching pairs (both images are of different bears). We decided to run through 100 batches (or folds), then average the results. We were getting and average accuracy of 93.28.
To compare an application’s ability to discriminate between pairs of faces, many researchers use the Labeled Faces in the Wild (LFW) data set. The LFW data set contains more than 13,000 images of faces collected from the web. For testing, the data set is split into 10 folds. Each fold contains 300 matching pairs and 300 non-matching pairs. Once you have a candidate network (which may be trained with outside data depending on your entry category), you generate test results by using a cross-validation. Essentially, use 9 folds for training and 1 fold for testing. Do this 10 times, using a different fold for testing each time, then average the results. If you are training with outside data, simply test against each fold of the LFW data set and average the result.
In our test set up, we were using 50 matching pairs and 250 non-matching pairs. This ratio of 1 matching to 5 non-matching pairs skews the results giving more weight to the network’s ability to determine if bears are different and less weight to the network’s ability to determine if bears are the same. Our metric, accuracy, is the number of correct prediction over the total number of tests. You can split this into two parts by looking at the networks accuracy on positive cases (True Positive Rate, or TPR) and negative cases (True Negative Rate, or TNR). Multiplying the TPR and TNR by the ratio of positive/negative examples and adding together yields the accuracy:
Accuracy = TPR * Positive Ratio + NPR * Negative Ratio
From our previous results, our network had a TNR of ~98% and a TPR of ~65%. So the network is predicting negative examples much better than positive examples. Since our test set has more negative examples (250/300) than positive examples (50/300), our accuracy was skewed higher:
Accuracy = (65 * 50/300) + (98 * 250/300) ~= 93%
If we use an even ratio of positive examples and negative examples we should get a lower accuracy:
Accuracy = 65 * 0.5 + 98 * 0.5 = 81.5%
The second problem with the test set up we used is related to the number of face images (chips) we have per bear. In the histogram above you can see that we have a few bears (>50) with few face chips (<10). In our test set, which is 20% of the data, most of the bears have less 5 chips. Our test set up will still randomly pick those labels, and will pick 5 faces, even when there are not five, by picking the same face more than once. If there’s only one face, it will use that face 5 times. The embedding of a given face is deterministic. So if we compare a face to itself, it will always match. This skewed our TRP to be higher than it should be. If we guarantee a face chip is not compared to itself, the TPR goes down to ~45% while the TNR remains about the same. This leads to an even lower accuracy:
Accuracy = 45 * 0.5 + 98 * 0.5 = 72%
Our new test procedure more closely resembles LFW. We split the test set into folds and make sure of the following:
As expected our reported accuracy went down to the low 70’s. Fortunately we have already made some improvements in the training process and have some other ideas in mind. We’ll explore those in later posts, so stay tuned!
]]>The workshop was led by Priyanka Bagade, IoT Developer Evangelist at Intel. The content for the workshop, including presentations and labs, is available on Intel’s Smart Video Workshop GitHub. A Lenovo laptop was provided for the workshop, along with a Movidius Neural Compute Stick. The laptop was running Ubuntu 16.04 and came preinstalled with all the workshop content (setup instructions are included in the README).
The agenda for the workshop was as follows:
The focus of the workshop was accelerating deep learning inferencing using Intel technologies. Mostly this pertained to utilizing OpenVINO to optimize pre-trained networks and run them on different Intel hardware, including CPU, GPU, FPGA and VPU. We were mainly looking at convolution neural networks for images and videos.
Inference is the action of applying a trained neural network to an input to generate an output. For example, let us say you have a network trained to identify dogs vs. cats in an image. You now want to use this network on a new image to see if the network predicts the image contains a dog or a cat. In fact, you want to deploy this network an a cloud server to detect dogs or cats in millions of user images. This is inferencing.
Inferencing is much less computationally intensive as compared with training. However, if you are running millions of inferences (as in the dog vs. cat service example), or you are running inferences on a constrained device (such as a Raspberry Pi), you may need acceleration to achieve the performance you desire. The performance criteria may include speed or power efficiency or both.
The OpenVINO Toolkit is an (mostly) open source toolkit from Intel. It works with pre-trained models in Caffe, TensorFlow or MXNet formats. The Model Optimizer converts the model into an intermediate format and performs some basic optimizations. The Inference Engine can then run the network on Intel CPUs, GPUs, FPGAs or VPUs (Movidius NCS). OpenVINO also contains tools for pre-processing and post-processing data which can be accelerated on CPUs or GPUs.
It is worth noting that OpenVINO uses a highly optimized library for CPU execution. So utilizing OpenVINO on your model will still provide performance improvements over running them through with Caffe, TensorFlow or MXNet on the same CPU.
The object detection section of the workshop walks through the development flow using OpenVINO for an object detection application. It also dives a little deeper into the main components of OpenVINO: Model Optimizer and Inference Engine.
The Model Optimizer reads in a model from one of the supported frameworks. It converts the model to a unified model intermediate representation (IR). The network is captured in an XML file and the weights are stored in a BIN file. The Model Optimizer also optimizes the model by merging nodes, decomposing batch normalization and performing horizontal fusion (essentially eliminating function call overhead). It will also perform quantization to convert from the input format (usually FP32) to the target format (FP16, INT8) as needed. Intel have already validated more than 100 commonly used models.
Note: The Model Optimizer does not support every known layer type. For a list of supported layers and information on supporting custom layers, check the Model Optimizer Guide.
The Inference Engine is a unified API for inference across all Intel architectures. It provides optimized inference on Intel architecture hardware targets (CPU/GPU/FPGA/VPU). It provides heterogeneous support allowing execution of model layers across hardware types. It enables asynchronous execution to improve end to end performance. It provides a common framework for running inference on current and future Intel architectures.
Inference Engine APIs are supported in C++ and python. The basic workflow is
Note: The Inference Engine supports different layers for different hardware targets. For a list of supported devices and layers, check the Inference Engine Guide
The lab instructions are provided in the repository as Object detection with OpenVINO™ toolkit. The lab uses a Caffe implementation of the MobileNet SSD model. Run the Model Optimizer (mo_caffe.py
) to generate the IR.
A sample application for the Inference Engine is provided as main.cpp. It will feed a video to the inference engine and outputs the results. This sample follows the basic workflow described earlier. For example, the code to load the model and weights is provided as
CNNNetReader network_reader;
network_reader.ReadNetwork(FLAGS_m);
network_reader.ReadWeights(FLAGS_m.substr(0, FLAGS_m.size() - 4) + ".bin");
network_reader.getNetwork().setBatchSize(1);
CNNNetwork network = network_reader.getNetwork();
Compile the application to tutorial1
and download a test video. Run the application on the CPU (by default).
./tutorial1 -i $SV/object-detection/Cars\ -\ 1900.mp4 -m $SV/object-detection/mobilenet-ssd/FP32/mobilenet-ssd.xml
Review the results with ROIviewer:
The sample application includes a flag to set the target device(s) and will load the appropriate inference plugin. We can run the application using the CPU:
./tutorial1 -i $SV/object-detection/Cars\ -\ 1900.mp4 -m $SV/object-detection/mobilenet-ssd/FP32/mobilenet-ssd.xml -d CPU
and the GPU:
./tutorial1 -i $SV/object-detection/Cars\ -\ 1900.mp4 -m $SV/object-detection/mobilenet-ssd/FP32/mobilenet-ssd.xml -d GPU
and compare the results:
At first I was surprised to see that the GPU was quite a bit slower than the CPU. Then I realized that the on chip GPU (HD Graphics 530) is fairly wimpy compared to the 8-core Core i7-6700HQ.
OpenVINO enables running networks on heterogeneous hardware. One of the main reasons for heterogeneous support in OpenVINO is for fallback. If the high performance hardware doesn’t support all the layers in your model (e. g. FPGA), then you can run the layers it does support on the fast hardware, and run unsupported layers on other targets (e. g. CPU or GPU).
The sample application supports a HETERO
device, which then allows you to prioritize the main device and a fallback device. For the lab, we used the CPU and GPU, and tried with each one as the main device. For example, here’s how we ran with the GPU as the prioritized device:
./tutorial1 -i $SV/object-detection/Cars\ -\ 1900.mp4 -m $SV/object-detection/mobilenet-ssd/FP32/mobilenet-ssd.xml -d HETERO:GPU,CPU
Since all the layers in the model are supported on both devices, the performance of each was about the same as if we ran with only that device:
OpenVINO provides additional examples of the Inference Engine APIs. These examples include a -pc
flag, which shows performance on a per layer basis. The lab runs through such an example with a car image classifier using SqueezeNet.
The next section of the workshop utilizes the Intel Movidius Neural Compute Stick (NCS). The NCS is a neural network accelerator in a USB stick form factor. It features the Movidius Vision Processing Unit, Myriad 2. The VPU solutions are intended for edge applications where compute power is limited and low-power is required. Example applications include drones, surveillance camera and virtual reality (VR) headsets.
For the lab, we can run the same sample program, tutorial1
, with the device flag set to MYRIAD
:
./tutorial1 -i $SV/object-detection/Cars\ -\ 1900.mp4 -m $SV/object-detection/mobilenet-ssd/FP32/mobilenet-ssd.xml -d MYRIAD
The NCS only supports FP16, and so far we have only used FP32. We need to run the Model Optimizer to quantize the model using --data_type FP16
:
python3 mo_caffe.py --input_model /opt/intel/computer_vision_sdk/deployment_tools/model_downloader/object_detection/common/mobilenet-ssd/caffe/mobilenet-ssd.caffemodel -o $SV/object-detection/mobilenet-ssd/FP16 --scale 256 --mean_values [127,127,127] --data_type FP16
The inference time is slower than the GPU and CPU:
Field-Programmable Gate Arrays (FPGAs) are a type of integrated circuit which can be configured after manufacturing. They consist of an array of programmable logic blocks, memory and reconfigurable interconnects. They can be programmed in the field to support any number of applications. Because they are programmable, they can be configured to support the exact acceleration necessary to support a specific neural network. Normally this requires extensive hardware development expertise, but OpenVINO aims to address some key challenges:
OpenVINO includes bitstream libraries, which are precompiled FPGA images for specific models. Custom bitstreams can be created using FPGA development tools. Bitstreams are loaded into the FPGA as part of the inference model for the device. Now the FPGA can be utilized as a target just like the others. Currently, OpenVINO has limited support for FPGA devices and network layers. See the Model Optimizer Guide for details. The HETERO
plugin is pretty much mandatory to provide fallback support to the CPU and/or GPU.
There are a number of optimization tools and techniques available. For example, the Model Optimizer can perform some basic optimization as previously mentioned. You can utilize the Model Optimizer and Inference Engine to run multiple models on multiple targets to find the best combination for your needs. You can utilize the performance counter APIs to get layer by layer performance numbers. You can utilize batching and asynchronous APIs to improve throughput. Increasing the batch size on tutorial1
provides some performance benefit:
By the time we get to a batch size of 16, the performance starts to decrease, likely due to reaching compute and/or memory constraints.
You may be able to trade of a little precision for better performance by quantizing the model, for example the GPU runs faster at FP16 than at FP32 with little impact on accuracy.
VTune Amplifier is a performance profiler from Intel. It has heterogeneous capabilities (CPU and GPU) and can be used with OpenVINO Inference Engine applications. The lab provided some hands on experience with VTune Amplifier using the same tutorial1 application.
We used the tool to analyze the timing data and top hotspots. The timing data includes things like CPU time, Clocks Per Instruction (CPI) rate and context switch time. The tool also listed the top 5 functions and top 5 tasks by CPU time as well as the effective CPU utilization histogram (cores and threads). A bottom-up view shows a sortable list of all functions. It includes an execution timeline which provides some insight into task switching an thread stalls.
To complete the lab, we reran the application using different parameters to see how they affect the timeline. We also used the tool to compare two different runs.
In the final section of the workshop we looked into some advanced topics such as chaining models and running multiple models on different hardware. We also ran through a TensorFlow example.
For this exercise we used the security barrier example included with the OpenVINO toolkit. This example uses 3 models to detect cars, their number plates, color and number plate attributes from the input video or image of the cars. The Intel models included in the application are:
There are a number of pre-trained models provided with the OpenVINO toolkit. Running the example application with a car image produced this result:
For this exercise we used the face detection example included with the OpenVINO toolkit. This example application can utilize models for face detection, age and gender detection, head pose estimation and emotion detection. We can also assign each model to run on different hardware. We ran the following models on the specified hardware:
with the built in camera, /dev/video0
, as the input using this command:
./interactive_face_detection_sample -i /dev/video0 \
-m $models/face-detection-retail-0004/FP16/face-detection-retail-0004.xml -d MYRIAD \
-m_ag $models/age-gender-recognition-retail-0013/FP32/age-gender-recognition-retail-0013.xml -d_ag CPU \
-m_hp $models/head-pose-estimation-adas-0001/FP16/head-pose-estimation-adas-0001.xml -d_hp GPU \
-m_em $models/emotions-recognition-retail-0003/FP16/emotions-recognition-retail-0003.xml -d_em GPU
The application output the video stream with all the model annotations like this:
The final exercise of the day was to run through an example using a TensorFlow model. The laptops already had the TensorFlow framework installed. We cloned the TensorFlow model repository and used an inception_v1 model. There are a few steps to setting up the TensorFlow model before running the Model Optimizer. Once we have completed these steps and run mo_tf.py
, we have the OpenVINO IR model. A sample application, classification_sample
, was provided. Once we have all the pieces in place, we ran:
./classification_sample -i car_1.bmp -m inception_v1_frozen.xml
The sample application utilizes the Inference Engine to run the model. The output is the top 10 results of the classifier for an input image.
I enjoyed the workshop. Priyanka did a great job presenting the OpenVINO material, and handed off to other Intel colleagues for presentations on FPGA and VTune Amplifier. There were a few other Intel employees on hand to assist with the labs. The labs were pretty easy since everything was set up and it’s pretty much cut and paste, but it still provided a pretty good overview of how and why you might want to use OpenVINO.
Conceptually, OpenVINO is a great toolkit for deploying deep learning inference. Having a common framework that can combine models from different frameworks and run them on different hardware is compelling. The obvious limitation is lack of support for non-Intel architectures. Since the toolkit is meant to be open source, perhaps other vendors will add in support for their architectures. It would be great to see support for NVIDIA, AMD, Arm and others. In the meantime, I may try it out on an Intel Atom board with the Movidius NCS.
VTune Amplifier is also a very powerful tool. It looks especially useful for deploying on Intel CPUs. Maybe I will try it with the Atom board as well.
Keep an eye out for this workshop. It’s worth a go if it comes near you. You can always work through the workshop content on your own, but that won’t get you any of the free give-a-ways!
]]>For the BearID Project, we’re currently following an approach based on FaceNet. Their dataset included hundreds of millions of images of millions of different individuals. Our dataset isn’t nearly as large as that, but it’s still many thousands of images. This article examines our approach to managing the data.
Deep learning is generally synonymous with large datasets. Our current dataset is about 80GB, though we expect it to grow by as much as an order of magnitude (and that’s still not large in comparison). During the training process, we may create a lot of new data, such as intermediate images, metadata and network weights files. All this data can add up.
Speed is another consideration. You most likely won’t be able to load all of your data into memory at one time (especially when using GPU acceleration), so you’ll have to load batches of data at a time. You want to be able to load the data as quickly as possible. You can hide some of the loading time by having your CPU start processing the next batch while your GPU is training the current batch, but fast storage can still be beneficial.
We built our deep learning computer with a fairly fast (up to 540 MB/s for reads) Solid State Drive (SSD). They are more expensive than traditional Hard Disk Drives (HDD), so we only got a 500GB SSD. As our dataset grew (and as we started working with other datasets, like for Kaggle competitions), we decided to add a 4TB HDD which runs at about 1/3 the speed of the SSD (up to 180 MB/s). We can store all our datasets on the HDD and copy over working sets to the SSD as needed.
Just to be safe, we also back up our key datasets on a portable drive and a Network Attached Storage (NAS) system.
How you organize your data is somewhat dependent on your application. For a basic image classification problem, you probably have a bunch of images and a set of metadata containing the labels for the images. You split your data into training and test data, pick some architectures and have at it.
The bearid
application reads in a photo and outputs a bear ID, much like a typical image classifier. However, as described in previous posts (FaceNet for Bears), the BearID application follows a four stage pipeline:
bearface
)bearchip
)bearembed
)bearsvm
)Each stage, aside from bearchip
, incorporates a neural network that requires training. Each stage has a set of inputs and outputs that need to be organized. The basic flow is as follows:
bearface
- Takes a source image and outputs a bounding box and 5 face points (eyes, ears and nose) for each bear face. It writes out this metadata in an XML file. It uses a neural network from dlib which was trained on dog faces. As we grow our bear dataset, we will retrain the network on bears.bearchip
- Take a source image and metadata and writes out a face chip (pose normalized, cropped image). We will need to experiment with the parameters for normalization and cropping.bearembed
- Takes a face chip and outputs an embedding file. It uses a neural network which was trained using our initial data. We will need to retrain using different datasets as well as different parameters for bearface
and bearchip
.bearsvm
- Takes an embedding file and determines the ID of the bear using a Support Vector Machine (SVM). The SVM will be retrained as the embeddings change. We may also choose to replace it with a different classifier method.During training, each stage utilizes large sets of input data and generates large sets of output data. We will likely run many experiments with each stage. We need to manage the results of all the experiments.
Our source data consists of photos of bears. The photos come from a number of different contributors (park staff, researchers, photographers, etc.) taken a different locations (Brooks Falls, Glendale Cove, etc.). We decided to keep our contributors separated so we can more easily trace back to the source and discern new data from older data. Since bears also change in appearance over the years (and even over the course of one year), we wanted to keep some date information.
Image courtesy of Katmai National Park
For simplicity, we decided to use a directory structure rather than implement a database. We have a top level directory with sub-directories for each location. Under each location, we have directories for each contributor and date of contribution. Under each contributor, we have a directory for each bear in the dataset and separate directories for photos containing unknown bears and/or multiple bears (including cubs). Our directory structure looks something like this:
image_source/
├── location_01
│ └── contributor01_date
│ ├── bear_id_01
│ ├── ...
└── location_02
├── contributor02_date
│ ├── bear_id_01
│ ├── ...
│ ├── bear_unknown
│ └── bear_multiple
└── contributor03_date
├── bear_id_cub
└── ...
When we receive photos from contributors, they are not necessarily structured the same way. We move them around manually or with scripts. The bear_id
directory names are unique identifiers per location which serve as the labels (more on labeling later). We generate XML files to collect the various images into the training, validation and test sets we want to utilize.
As we experiment with each stage, we keep the results to be used for subsequent stages. Since each stage of the pipeline depends on the previous stage, we decided to structure our results accordingly:
face_config_01
├── face_meta
├── chip_config_01
│ ├── chip_meta
│ ├── chip_images
│ │ ├── location_01
│ │ │ ├── bear_id_01
│ │ │ └── ...
│ │ └── location_02
│ │ └── ...
│ ├── embed_config_01
│ │ ├── embed_meta
│ │ ├── embeddings
│ │ │ ├── bear_id_01
│ │ │ └── ...
│ │ ├── class_config_01
│ │ │ ├── class_meta
│ │ │ └── classifier_01
│ │ ├── class_config_02
│ │ | └── ...
│ │ └── ...
│ ├── embed_config_02
│ │ └── ...
│ └── ...
├── chip_config_02
│ └── ...
└── ...
face_config_02
└── ...
face_config_XX
: The top levels correlate to the bearface
stage. Each bear face configuration has a separate directory for its results (face_config_01
, face_config_02
, …). Results include the face metadata XML files (bounding box, face points and ID labels) and the bearface
neural network configuration and weights. Currently we have only 1 face configuration based on manually adjusted labels (more on labels below). This will change when we retrain the bearface
network.chip_config_XX
: The second levels correlate to the bearchip
stage. Since the inputs are different for each bearface
configuration, they appear under the appropriate face_config_XX
. Each bearchip
configuration will produce different results, so there are multiple chip_config_XX
under each face_config_XX
. The results include the face chip images and relevant metadata.embed_config_XX
: The third levels correlate to the bearembed
stage. Again they appear under each chip_config_XX
. Training the embedding network with different chip configurations and different parameters will result in new network weights. For a given set of parameters and weights, a full set of embeddings is generated.class_config_XX
: The final levels correlate to the bearsvm
stage. They appear under each embed_config_XX
. While we are currently using an SVM for classification, this is something we will experiment with. Training the classifier with new embeddings and architectures will result in different networks and weights. These are the final outputs of the training process.As we try different experiments, some will be successful, and some not. We can abandon the unsuccessful paths and focus on those which are more promising. Our path from source image to bear ID with the best results (based on our metrics) will be used for the end-to-end bearid
application.
Labeled training data is the cornerstone of supervised learning. In our case, the main labels we deal with are the bear identity and the face metadata.
Each source image is labeled with the ID of the bear(s) in the image. The labels are given by experts in the field who (hopefully) really know the bears. Generally contributors provide us photos which have been sorted into directories with the ID of the bear. Occasionally the bear ID is part of the file name. In either case, we copy them into the directory structure described previously. We carry the labels through all the metadata that we generate along the way so it can be utilized by the appropriate training and testing code. The bear ID is utilized by both bearembed
and bearsvm
.
ID | Example Face Images |
---|---|
Also | |
Beatrice | |
Chestnut |
Images courtesy of the Brown Bear Research Network
The face metadata to train bearface
is more cumbersome. To create a face and key point detector, you need to label each source image with a bounding box for the face and point for each of nose, left eye, right eye, left ear and right ear.
Image Courtesy of Katmai National Park
Fortunately, we were able to use an example program from the Dlib Toolkit called Dog Hipsterizer (see the post Hipster Bears). It works pretty well with bears, but not perfectly. We run all our source images through a modified version of the dog hipsterizer to get the initial face metadata XML files. Then we use another dlib tool called imglab to manually fix the boxes and key points. As our dataset grows, maybe we can farm this task out to volunteers or to a platform like Amazon Mechanical Turk.
The final point I’ll cover here is data cleaning. Essentially, you want the dataset to represent the problems you are trying to solve as accurately as possible. After all: garbage in; garbage out.
Here are some examples which illustrate how the pose of a bear can impact the face chip, even when normalized:
Images courtesy of the Brown Bear Research Network
Managing data for a deep learning can be a chore. At the outset you should consider issues like storage, organization, labeling and cleaning. If you address each of these issues well, you will have higher productivity, improved performance and better results. It’s always a good time for spring cleaning!
]]>Before we start to think about machine learning in 3D, let us first understand how we can sense the world in 3D. We know that one of the most common sensors we use for 2D sensing is the digital camera. What are the options for 3D image sensing? It turns out there are quite a few.
The idea with 3D imaging is to figure out where everything is in 3D space. To figure out where every point of every object is in 3D has a lot of challenges. Instead, these 3D technologies sample a subset of points in the scene, determining their X/Y/Z coordinates, creating a point cloud. The image below shows a point cloud from 2 different perspectives.
The following sections describe various 3D imaging technologies.
Photogrammetry is the use of photography to make measurements. Photogrammetry has many applications, one of which is generating 3D structure from multiple 2D images. The method involves photographing a scene from multiple, overlapping angles. By applying concepts from optics and geometry, points from overlapping images can be correlated and their location in 3D space can be calculated (see image from the paper Reconstructing Rome below).
These calculations can be used to create a point cloud of the scene, but only if there are a sufficient number of images with sufficient overlap. Since this method produces a 3D structure using photos from various locations, it is also known as Structure from Motion (SfM). For an interesting example of SfM, check out Building Rome in a Day, where a research team from the University of Washington reconstructed 3D models of Rome (and a few other cities) using photos harvested from Flickr.
Pros: Can use any digital cameras; no light projection
Cons: Requires a lot of photos with lots of overlap; needs a lot of compute
A stereo camera utilizes two image sensors fixed at a known distance and orientation to emulate binocular vision. This is a special case of photogrammetry called stereophotogrammetry. The stereo setup can be used to calculate a depth map using triangulation.
A depth map is a 2D representation of the 3D information. Each pixel in the depth map represents the distance of the corresponding object from the camera. Here’s an example depth map image, where the redder areas are nearer and the bluer areas are father.
The depth information can be combined with the RGB data to calculate a point cloud of the scene. The calculations for a stereo camera are similar to photogrammetry, but the relationship between the two cameras is known, requiring less computation.
An example of stereo camera is the ZED from Stereolabs.
Pros: Fairly small; no light projection
Cons: Still needs a lot of compute; depth perception limitations when cameras are close to each other
Structured light utilizes a projected pattern of light to augment the scene. Patterns can range from lines and grids to random dot patterns. The camera senses the scene and calculates the depth information based on the deformation of the known pattern. To avoid disrupting the scene, the projected light pattern is often something outside the visible spectrum, such as infrared (IR).
Some good examples using of a structured light are the latest Intel RealSense cameras, D415 and D435 (pictured above). They actually combine structured light and stereo (Structure+Stereo) by utilizing an IR projection and two IR image sensors to provide more accurate depth information. The depth map is combined with a color camera to produce point clouds.
Pros: Fairly small; easier computation; Structure+Stereo has better resolution and accuracy compared to Structured or Stereo alone
Cons: Requires projected light
LiDAR determines the distance to a target by projecting a laser and measuring the time it takes to receive the reflection. By scanning the laser across a scene, a point cloud can be generated, however the point cloud does not provide color information. These laser scanning LiDAR are used quite extensively in autonomous cars. The image below shows a Google Car and LiDAR image from an article in Popular Science. The rings indicate the scanning pattern of the laser.
Pros: High accuracy and good scene coverage with scanning LiDAR; little computation required
Cons: No color; fairly large and heavy; quite expensive; requires projected light
Similar to LiDAR, a Time of Flight (ToF) Camera also determines distance by measuring the “time of flight” of light. Rather than scanning a single laser across a scene, the ToF camera projects a broad beam and captures the entire scene at once. The sensors are arranged in a grid, much like image sensors, and distance is measured on a pixel by pixel basis. There are a number of projection/sensing technologies including phase detection, range gated imagers and direct ToF imagers. ToF camera technology is sometimes referred to as scannerless LiDAR or Flash LiDAR.
A popular example of a ToF camera is the Kinect for Xbox One (the original Kinect for Xbox 360 was a structured light camera):
Pros: Fairly small; little computation required
Cons: Limited resolution; requires projected light; fairly new (so still a bit pricey)
We wanted to start experimenting with 3D sensing, so we compared the various technologies. We wanted a solution that was reasonably accurate, real time, low cost and fairly small and light weight (for mobile applications). We considered the following criteria on a scale of 0-3 (0=worst, 3=best):
Based on our selection criteria and the results in the table above, we decided to go with a Structure+Stereo solution. Specifically, we chose Intel RealSense. We wanted to get the Intel RealSense D435, but at the time we tried to buy one, they were already on back order. Instead, we settled for the R200 as part of the Intel RealSense Robotic Development Kit.
We will talk about that in more detail next time.
]]>