Real-Time Sign Language Detection (Tensorflow 2 + SSD Mobile net v2)

Introduction

Rancy Chepchirchir
5 min readMay 14, 2023

This project was inspired by Nicholas Renotte and his wonderful YouTube ‘Real Time Sign Language Detection with Tensorflow Object Detection and Python | Deep Learning SSD’ tutorial. The OpenCV library is utilized in the project implementation of creating a model for sign language detection, which uses a web camera to capture images of hand motions. A pre-trained model SSD Mobile net v2 is utilized for sign identification after the photos have been captured and labeled. As a result, a successful communication channel between the hearing audience and the deaf can be established. To address our issue, three stages must be carried out in real time:

i.) Obtaining footage of the user signing is step one (input)

ii.) Classifying each frame in the video to a sign

iii.) Reconstructing and displaying the most likely sign from classification scores (output)

This model employs a pipeline that receives input from a user who is signing a gesture using a web camera, extracts various video frames, and then provides sign language options for each move.

Methodology

The study involved several steps including: Installation of the Object detection API, Installing the LabelImg for labeling the datasets; Downloading the pretrained weights for the selected model (in our case, SSD Mobile net v2) Copying the base config for the model; Creating the label_pbtxt file from the earlier steps, and updating the model configuration file.

Dataset

For this project, a user defined dataset is used. It is a collection of over 2000 images, around 400 for each of its classes. The dataset included 5-hand gestures/signs including; i.) Hello ii.) Yes iii.) No iv.) Thank you and v.) I love you; which is quite useful while dealing with the real time application.

a.) Hello b.) Yes c.) Thank you d.) No e.) I love you

Algorithm

i.) Convolutional Neural Network (CNN)

A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning system that can take an input image and give various elements and objects in the image importance (learnable weights and biases), as well as distinguish between them. Compared to other classification algorithms, a ConvNet requires far less pre-processing.

Tools Used:

a.) TensorFlow: It is an open-source artificial intelligence package that builds models using data flow graphs. It enables developers to build large-scale neural networks with several layers. TensorFlow is mostly used for classification, perception, comprehension, discovery, prediction, and creation. [11]

b.) Object Detection API: An open source TensorFlow API to locate objects in an image and identify it.

c.) Open CV: An open-source, highly optimised library for tackling computer vision issues. It is primarily focused on real-time applications that provide computational efficiency.

d.) LabelImg: LabelImg is a graphical image annotation tool that labels the bounding boxes of objects in images.[13]

Findings

The model was trained using transfer learning and a pre-trained model SSDmobile net v2 was used. When a model that has been trained on one problem is used in some way to solve a second, similar problem, the process is known as transfer learning. It involves training a neural network model on a problem that is similar to the one being addressed before applying it to the problem at hand. Using one or more layers from the learnt model, a new model is then trained on the problem of interest [14]. The Mobile Net SSD model is a single-shot multibox detection (SSD) network that scans the pixels of an image that are inside the bounding box coordinates and class probabilities to conduct object detection. In contrast to standard residual models, the model’s architecture is built on the notion of inverted residual structure, in which the residual block’s input and output are narrow bottleneck layers. In addition, nonlinearities in intermediate layers are reduced, and lightweight depthwise convolution is applied. The results are given the table below;

Accuracy across the 5 hand symbols

The same are also presented in the diagrams below respectively;

a.) Hello b.) Yes c.) No d.) Thank you e.) I love you

Conclusion

The main purpose of sign language detection system is to provide a feasible way of communication between a hearing and a non-hearing people by use of hand gestures. The proposed system can be accessed by using webcam or any in-built camera that detects the signs and processes them for recognition. We can then infer from the model’s output that, given conditions of controlled light and intensity, the suggested system can produce reliable results. Furthermore, custom gestures can easily be added and more images taken at different angles and frame to improve the model performance but with demand for more computational power and memory (GPU). Therefore, the model can be extended on a large scale by increasing the variety of the dataset. The model has some limitations such as environmental factors like low light intensity and uncontrolled background which lowers the accuracy of the detection. Therefore, we’ll work to overcome these flaws and also increase the dataset for more accurate results.

References:

[1] Martin D S 2003 Cognition, Education, and Deafness: Directions for Research and Instruction (Washington: Gallaudet University Press)

[2] McInnes J M and Treffry J A 1993 Deaf-blind Infants and Children: A Developmental Guid (Toronto : University of Toronto Press)

[3] http://www.who.int/mediacentre/factsheets/fs300/en/

[4] Harshith.C, Karthik.R.Shastry, Manoj Ravindran, M.V.V.N.S Srikanth, Naveen Lakshmikhanth, “Survey on various gesture recognition Techniques for interfacing machines based on ambient intelligence”, International Journal of Computer Science & Engineering Survey (IJCSES) Vol.1, №2, (November 2010)

[5] SAKSHI GOYAL, ISHITA SHARMA, S. S. Sign language recognition system for deaf and dumb people. International Journal of Engineering Research Technology 2, 4 (April 2013).

[6] Chen L, Lin H, Li S (2012) Depth image enhancement for Kinect using region growing and bilateral filter. In: Proceedings of the 21st international conference on pattern recognition (ICPR2012). IEEE, pp 3070–3073

[7] Vaishali.S.Kulkarni et al., “Appearance Based Recognition of American Sign Language Using Gesture Segmentation”, International Journal on Computer Science and Engineering (IJCSE), 2010

[8] Cheok, M. J., Omar, Z., &Jaward, M. H. (2019). A review of hand gesture and sign language recognition techniques. International Journal of Machine Learning and Cybernetics, 10(1), 131–153

[9] Al-Saffar, A. A. M., Tao, H., & Talab, M. A. (2017, October). Review of deep convolution neural network in image classification. In 2017 International Conference on Radar, Antenna, Microwave,

Electronics, and Telecommunications (ICRAMET) (pp. 26–31). IEEE.

[10] Kiron Tello O’Shea, An Introduction to Convolutional Neural Networks, (Nov 2015). Research GATE

[11] https://www.exastax.com/deep-learning/top-five-use-cases-of-tensorflow/

[12] https://en.m.wikipedia.org/wiki/OpenCV

[13] https://github.com/tzutalin/labelImg

[14] https://www.amazon.com/Deep-Learning-Adaptive-Computation-, Deep Learning, 2016.]

[15] https://machinethink.net/blog/mobilenet-v2/ by Matthijs Hollemans [22 April 2018]

P.S. I’m creating a Kenyan & Swahili Sign Language Image Databases, If this sounds like something you’d wanna take part in; clone, create a branch and commit your hand signal images to this repo. Thanks again! Cheers!

--

--