Fire detection methods based on an optimized YOLOv5 algorithm

Zhenlu Shao; Siyu Lu; Xunxian Shi; Dezhi Yang; Zhaolong Wang; Zhenlu Shao; Siyu Lu; Xunxian Shi; Dezhi Yang; Zhaolong Wang

doi:10.48130/EMST-2023-0011

Computer vision technology has broad application prospects in the field of intelligent fire detection, which has the benefits of accuracy, timeliness, visibility, adjustability, and multi-scene adaptability. Traditional computer vision algorithm flaws include erroneous detection, detection gaps, poor precision, and slow detection speed. In this paper, the efficient and lightweight YOLOv5s model is used to detect fire flames and smoke. The attention mechanism is embedded into the C3 module to enhance the backbone network and maximize the algorithm's suppression of invalid feature data. Alpha CIOU is adopted to improve the positioning function and detection target. At the same time, the concept of transfer learning is used to realize semi-automatic data annotation, which reduces training expenses in terms of manpower and time. The comparative experiments of six distinct fire detection algorithms (YOLOv5 and five optimization algorithms) are carried out. The results indicate that the self-attention mechanism based on the Transformer structure has a substantial impact on enhancing target detection precision. The improved location function based on Alpha CIOU aids in enhancing the detection recall rate. The average recall rate of fire detection of the YOlOv5+TR+αCIOU algorithm is the highest, which is 68.5%, clearly outperforming other algorithms. Based on the surveillance video, this optimization algorithm is utilized to detect a fire in a factory, and the fire is detected in the 9^th second when it starts to appear. The results demonstrate the algorithm's viability for real-time fire detection.

HTML

Introduction

Fire is a natural and social disaster with the highest probability of occurrence, posing a grave threat to human life and the stable development of society and the economy^[1]. China is one of the nations hardest hit by fire worldwide. According to statistics released by the Fire and Rescue Department Ministry of Emergency Management, 2021 is the year with the largest number of police reports received by the fire rescue team, of which 38.1% are fire-fighting tasks. 748 thousand fires were reported throughout the entire year, causing 1,987 deaths, 2,225 injuries, and direct property losses of 6.75 billion yuan^[2].

The development of fire is generally divided into four stages: the slow growth stage, the rapid growth stage, the fully grown stage, and the decay stage^[3]. Initial slow growth stage is characterized by low burning intensity, a small surface area, low temperature, and low radiant heat. The optimal timing for firefighting is when the fire can be contained with fewer human and material resources, which is when it is noticed and dealt with promptly. After the fire enters the second stage, if there is no external control, the fire's spread will increase rapidly, which is essentially proportional to the square of time, and it will develop rapidly to the fully grown stage. During this stage, the fire's development range rapidly extends and the temperature peaks, making it difficult to extinguish. In addition, the fire has a strong contingency, making it difficult to be noticed by humans in the initial slow growth stage. Hence, it is of the utmost importance to precisely and swiftly detect the initial fire and issue an alarm, which can considerably increase the efficiency of firefighting and rescue activity, thereby reducing the loss of life and property.

At present, the vast majority of conventional fire detection systems are outfitted with sensitive electronic sensors to detect fire-related characteristics such as smoke, temperature, light, and gas concentration^[4−6]. Due to the sensitivity of these detectors, changes in the surrounding environment will alter the detection effect and result in false alarms. Installation of sensor-type fire detectors in a large space environment necessitates a large number of detectors, resulting in expensive prices, laborious installation, and potential circuit safety issues. It is therefore not ideal for large-space fire detection. In addition, fire characteristic factors require time before they can be detected by sensors, allowing ample time for the fire to spread and produce a catastrophe.

With the advancement of computer hardware and software technology, computer vision has begun to be used in various industries^[7−9]. Utilizing computer vision to recognize things in video and photos and then identifying them with precision and speed offers numerous potential applications. Computer vision analyzes the image characteristics to detect whether a fire is present in the image, and performs tasks such as recognizing and locating the fire, as well as evaluating burning combustibles. The technique for detecting fires that is based on computer vision has a quick identification rate and a quick response time, and it has evident advantages in situations involving rapid flow, large space, and unknown environments. Also, the price is modest. Existing monitoring equipment can be utilized to collect video and picture data in real-time without the need for additional hardware. Vision-based fire detection systems can deliver intuitive and comprehensive fire data (such as the location, severity, and surrounding environment of the fires). The visual recognition algorithm based on machine learning is capable of updating and learning, which may significantly increase fire recognition accuracy and decrease false alarms. In addition, it can simultaneously detect smoke and fires.

Fire detection is an application of computer vision technology that necessitates a rapid detection reaction while maintaining accuracy. In addition, fire primarily detects smoke and flames, making it a multi-target detection challenge. Training and detection are typically time-consuming processes. Researchers around the world have conducted large amounts of work in this direction. Zhong et al.^[10] implemented CNN-based video flame detection. Zhang et al.^[11] proposed a multi-scale faster R-CNN model, which effectively improves fire detection accuracy. Yu & Liu^[12] added a bottom-up feature pyramid to Mask R-CNN to improve flame detection accuracy. Abdusalomov et al.^[13] proposed a fire detection method based on YOLOv3. Zheng et al.^[14] proposed a fire detection method based on MobileNetV3 and YOLOv4. Different methods can achieve good performance in a specific image dataset. However, due to the poor robustness of the algorithms, the performance tends to be poor in different image datasets, and these methods make it difficult to eliminate the complex interference in real applications. Nowadays, there is a great deal of space for optimization and development in computer vision's accuracy of fire recognition and sample training time. This study employs the YOLOv5 algorithm as the fundamental fire recognition algorithm. Coordinate attention and multi-head self-attention based on the Transformer structure are presented and embedded into the last C3 module of the backbone network to improve the feature extraction capability. Simultaneously, the Alpha-CIOU loss function has been employed to improve the localization loss function and target identification accuracy. Moreover, transfer learning has been implemented to reduce the training cost for fire targets (fire and smoke). In the end, a test for factory fire detection has been carried out to validate the presented optimization technique and to enhance the precision of fire identification.

Architecture of the YOLOv5 model

YOLO (short for You Only Live Once) is the most classic and advanced computer vision algorithm in the single-stage deep convolution target recognition algorithm. The algorithm directly extracts features and predicts object categories and positions on the input original image through the model network generated by previous target training, and realizes end-to-end real-time target detection. Its fundamental concept is to divide an input image into S × S grid cells, with the grid containing the object's center being responsible for object prediction. Each grid can pre-select B bounding boxes and determine their position (x, y, w, h), confidence, and category information C. The classification and position regression are merged into a single regression problem, whereby the loss function of each candidate frame is calculated at each iteration, and the parameters are iteratively learned through backpropagation. Lastly, the graphic displays the target prediction frame, target confidence, and category prediction probability. The YOLO series continues to be iteratively optimized by network model modification and technological integration.

The YOLOv5 algorithm was proposed by Glenn-Jocher and numerous Ultralytics contributors in 2020 to improve YOLOv4. It is one of the most powerful object detection algorithms available today and the fastest inference process. This paper uses the YOLOv5s-6.0 version to detect fires. The model's architecture consists of four components: Input, Backbone, Neck, and Head. The Input performs adaptive anchor box computation and adaptive scaling on the image (size is 640 × 640 pixels) and employs the mosaic data augmentation method to increase the training speed of the model and the precision of the network. The Backbone is a convolutional neural network that gathers and produces fine-grained visual information. A Neck consists of a sequence of network layers that aggregate and blend visual information before sending it to the prediction layer. The Head makes predictions on image features, generates bounding boxes, and predicts categories.

Improvement of optimization algorithm

Improvement of backbone by embedding attention mechanism

The attention mechanism in computer vision deep learning is comparable to the selective visual attention mechanism in humans. The core goal is to select, from a great quantity of information, the data that is most relevant to the current task objective. The attention mechanism is now utilized extensively in the field of computer vision and has produced outstanding achievements. It enriches the information of the target features by improving the ability to extract the target information features of a specific area in the image, and improves the detection accuracy to a certain extent.

Coordinate attention mechanism
Coordinate Attention (CA) is a novel channel attention mechanism designed by Hou et al^[15]. CA embeds position information into the channel attention by extending the channel attention into two one-dimensional feature codes in the length and width directions, and then re-aggregating the features along these two spatial directions to produce a feature vector. This mechanism abandons the brute force conversion of feature tensor into a single feature vector by two-dimensional global pooling of spatial information, such as the squeeze-and-excite (SE) channel attention mechanism. In light of CA's superior performance and plug-and-play adaptability in object recognition studies, this study introduces CA for feature extraction in the backbone.

CA not only takes into consideration channel information, but also location-based spatial information. The horizontal and vertical attention weights obtained represent the presence or absence of focal regions in the respective rows and columns of the feature images. This encoding more precisely locates the position of the target focus, hence enhancing the recognition ability of the model.

Self-attention mechanism in Transformer
Since it was proposed, the Transformer module incorporated with the self-attention mechanism has produced remarkable results in natural language processing (NLP) problems. Microsoft proposed using the Transformer structure to address the vision task in 2020, and the DETR (Detection Transformer) network model is the pioneering effort in target detection^[16]. The transformer encoder block improves the capacity to capture diverse local data. Positional embedding, Encoder, and Decoder are the three components of the Transformer model. According to the Encoder structure of DETR, this paper introduces multi-head attention to the YOLO backbone. The Transformer block is replaced by the bottleneck blocks of the C3 structure, as well as the final C3 module of the New CSP-Darknet53 convolutional network. This layer has the largest number of channels and the most abundant computer semantic features, allowing it to capture a wealth of global and contextual information.

The self-attention mechanism of the Transformer structure used in this paper is improved from the Encoder network layer structure of the Transformer module designed by DETR. The two sub-network layers and residual link structure of Multi-Head Attention and Feed-Forward Network (FFN) are preserved, but the original structure's normalization process (Batch Normalization layer) is omitted.

Improvement of loss function based on α-CIoU

Locating the loss function in the backpropagation process is crucial for updating the bounding box regression of the iterative target location information parameters. Intersection over Union (IoU) loss is the most classic loss function for bounding box regression. However, there are obvious drawbacks to the IoU loss function. For instance, when IoU = 1, the candidate frame and the real frame GT completely overlap, but they do not reflect a complete encapsulation of the target, and there may be a very low degree of overlap between the two frames. More severely, when IoU = 0, the candidate frame and the real frame GT do not intersect, Loss (IoU) = 1, the gradient disappears in the IoU loss, and multiple random matchings are necessary to generate an intersection. All of these factors will result in slower model convergence and decreased detection model precision. In order to increase the accuracy, the loss function is modified based on the IOU, and the overlap region, center point, width and height of the candidate box, and normalizing terms are introduced.

In this paper, according to He et al.^[17], the CIoU is improved by introducing the parameter α to adjust the power level of the IoU and by adding a power regularization term to the general form of the α-IoU. α-CIoU is obtained by exponentiating CIoU (Eqn 1), and a new loss function is proposed, as shown in Eqn 2.

$\alpha \text{-CIoU}=\text{Io}{\text{U}}^{\alpha }+\frac{{\rho }^{2\alpha }(b,{b}^{\text{gt}})}{{c}^{2\alpha }}+(\beta v{)}^{\alpha }$

(1)

${L}_{\alpha \text-\text{CIoU}}=1-\text{Io}{\text{U}}^{\alpha }+\frac{{\rho }^{2\alpha }(b,{b}^{\text{gt}})}{{c}^{2\alpha }}+(\beta v{)}^{\alpha }$

(2)

where, $\,{\rho }^{2\alpha }(b,{b}^{\text{gt}})$ represents the Euclidean distance from the center point of the prediction frame to the center point of the target frame. $b$ and ${b}^{\text{gt}}$ respectively represent the center point of the two candidate boxes. $c$ is the diagonal distance of the smallest circumscribed rectangle between the candidate box and the ground truth box. The shape factor $\text{C}$ is measured by the respective rectangular box aspect ratios of the candidate box and the ground truth box, $\, \beta$ is a positive trade-off parameter, and $v$ is a consistency parameter for measuring the aspect ratio. The α-CIoU loss function maintains the fundamental characteristics of the IoU-type loss function, including non-negativity, indistinguishable identity, symmetry, and triangle inequality. In addition, as the model is trained, the α-CIoU position loss continuously learns in the direction of approaching 0. Due to the properties of adaptive relative loss reweighting and adaptive relative gradient reweighting, the learning rate is continuously adjusted so that the speed at which simple targets are learned increases over time. When learning challenging targets at a later stage, the training speed is improved by increasing the weights of target loss and gradient for high IoU.

As an adjustable parameter, α provides flexibility for achieving varying levels of Bounding Box (BBox) regression accuracy when training the target model. According to our previous experiments, the value of α is not overly sensitive to the impact of various models or data sets. When 0 < α < 1, the final target localization effect is not good due to reducing the loss and gradient weight of high IoU targets. When α > 1, increase the relative loss and gradient weight of the high IoU target, so that the high IoU target attracts more attention, and increases the high IoU regression gradient, thereby speeding up the training speed and improving the BBox regression accuracy. Experiments indicate that when α is set to 3, the performance on multiple data sets is consistently good^[17]; therefore, the value of α presented in this paper is 3.

Improvement of training strategy based on transfer learning

In this paper, the initial training weight is based on the YOLOv5s model and MS COCO (Microsoft Common Objects in Context) data set. Although the MS COCO dataset does not contain samples such as smoke and fires, they all identify the targets in the image by learning image target labels. The early picture processing methods are similar. This approach is a kind of homogeneous transfer learning from the perspective of the source field and the target domain, inductive transfer learning from the perspective of the label-based setting classification, and characteristic transfer learning from the perspective of the transfer method. Employing this strategy can increase the generalization of the model over a variety of MS COCO data sets (80 included), drastically reduce the time and labor cost of data labeling, generate rough models for fire target detection, and lay the groundwork for automatic labeling.

At the same time, we utilize the rough model to reason and identify a large number of unlabeled photos, record the types and location information of the target, and then update each rectangle box to collect new fire and smoke data for the original rough model. Target recognition rough models are retrained to iterate through this cycle and continuously obtain more precise fire target recognition models. Using instance-based transfer learning to improve the ability to identify fire and smoke, the number of training data sets is gradually raised and the cost of individual training is decreased. Similarly, the training approach based on instances can be utilized to further strengthen the training effect of fire recognition in a specialized context in accordance with the scenario's specialization.

Conclusions

This paper investigates the viability of the attention mechanism, loss function, and transfer learning in further optimizing the fire detection effect using the YOLOv5 algorithm. The key findings are as follows:

(1) The coordinate attention mechanism CA and the self-attention mechanism based on Transformer structure are embedded into the C3 module to create a new backbone network. Feature weighting focuses on the desired feature points and extracts useful feature information. A parameter is imported to exponentiate the original positioning function CIOU, which facilitates the production of a more accurate prediction box. The yolov5s.pt file is used to train the crude fire detection model, which enables semi-automatic data set annotation and minimizes training expenses.

(2) Fire identification experiments under varying conditions are carried out for YOLOv5 and five optimization algorithms. The experimental results show that embedding attention mechanism and modifying location function have significant optimization effects on detection accuracy and recall rate.

(3) YOlOv5 + TR + αCIOU algorithm is adopted to detect the factory fire video, and achieves an excellent balance between detection precision and speed.

(4) The YOLO algorithm used in this paper cannot recognize the motion features of the target. In the future, the separation of moving and static objects can be achieved by further introducing time information and considering the relationship between consecutive frames.

(5) In the future, attempts can be made to replicate and locate target detection results in 3D space By utilizing digital twin technology, the real-time simulated on-site 3D scenes could be obtain, which can help provide rich visualization and operational information.

Author contributions

The authors confirm contribution to the paper as follows: study conception and design: Shao Z, Lu S, Shi X; data collection: Yang D, Wang Z; analysis and interpretation of results: Lu S, Shi X, Yang D; draft manuscript preparation: Lu S, Wang Z. All authors reviewed the results and approved the final version of the manuscript.

[1]	Kobes M, Helsloot I, De Vries B, Post JG. 2010. Building safety and human behaviour in fire: A literature review. Fire Safety Journal 45:1−11 doi: 10.1016/j.firesaf.2009.08.005 CrossRef Google Scholar
[2]	Wang Z, Li T. 2022. A lightweight CNN model based on GhostNet. Computational Intelligence and Neuroscience 2022:8396550 doi: 10.1155/2022/8396550 CrossRef Google Scholar
[3]	Drysdale D. 2011. An Introduction to Fire Dynamics. 3^rd Edition. UK: John Wiley & Sons. 576 pp. https://doi.org/10.1002/9781119975465
[4]	Liu Z, Kim AK. 2003. Review of recent developments in fire detection technologies. Journal of Fire Protection Engineering 13:129−51 doi: 10.1177/1042391503013002003 CrossRef Google Scholar
[5]	Gaur A, Singh A, Kumar A, Kulkarni KS, Lala S, et al. 2019. Fire sensing technologies: A review. IEEE Sensors Journal 19:3191−202 doi: 10.1109/JSEN.2019.2894665 CrossRef Google Scholar
[6]	Röck F, Barsan N, Weimar U. 2008. Electronic nose: current status and future trends. Chemical Reviews 108:705−25 doi: 10.1021/cr068121q CrossRef Google Scholar
[7]	Davies ER. 2004. Machine vision: theory, algorithms, practicalities. 3^rd Edition. San Francisco, USA: Academic Press, Elsevier. https://doi.org/10.1016/C2013-0-10565-X
[8]	Ma J, Sun DW, Qu JH, Liu D, Pu H, et al. 2016. Applications of computer vision for assessing quality of agri-food products: a review of recent research advances. Critical Reviews In Food Science And Nutrition 56:113−27 doi: 10.1080/10408398.2013.873885 CrossRef Google Scholar
[9]	Szeliski R. 2022. Computer Vision: Algorithms and Applications. Cham, Switzerland: Springer Nature. 925 pp. https://doi.org/10.1007/978-3-030-34372-9
[10]	Zhong Z, Wang M, Shi Y, Gao W. 2018. A convolutional neural network-based flame detection method in video sequence. Signal, Image and Video Processing 12:1619−27 doi: 10.1007/s11760-018-1319-4 CrossRef Google Scholar
[11]	Zhang L, Wang M, Ding Y, Bu X. 2023. MS-FRCNN: A Multi-Scale Faster RCNN Model for Small Target Forest Fire Detection. Forests 14:616 doi: 10.3390/f14030616 CrossRef Google Scholar
[12]	Yu L, Liu J. 2020. Flame image recognition algorithm based on improved Mask R-CNN. Computer Engineering and Applications 56:194−98 doi: 10.3778/j.issn.1002-8331.2006-0194 CrossRef Google Scholar
[13]	Abdusalomov A, Baratov N, Kutlimuratov A, Whangbo TK. 2021. An improvement of the fire detection and classification method using YOLOv3 for surveillance systems. Sensors 21:6519 doi: 10.3390/s21196519 CrossRef Google Scholar
[14]	Zheng H, Duan J, Dong Y, Liu Y. 2023. Real-time fire detection algorithms running on small embedded devices based on MobileNetV3 and YOLOv4. Fire Ecology 19:31 doi: 10.1186/s42408-023-00189-0 CrossRef Google Scholar
[15]	Hou Q, Zhou D, Feng J. 2021. Coordinate Attention for Efficient Mobile Network Design. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20-25 June 2021. USA: IEEE. pp. 13708−17. https://doi.org/10.1109/CVPR46437.2021.01350
[16]	Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In Computer Vision – ECCV 2020, eds. Vedaldi A, Bischof H, Brox T, Frahm JM. pp. 213−29. Switzerland: Springer Cham. https://doi.org/10.1007/978-3-030-58452-8_13
[17]	He J, Erfani S, Ma X, Bailey J, Chi Y, et al. 2021. α-IoU: A family of power intersection over union losses for bounding box regression. 35^th Conference on Neural Information Processing Systems (NeurIPS 2021). pp. 1−19. https://doi.org/10.48550/arXiv.2110.13675
[18]	Chino DYT, Avalhais LPS, Rodrigues JF, Traina AJM. Bowfire: detection of fire in still images by integrating pixel color and texture analysis. 2015 28^th SIBGRAPI conference on graphics, patterns and images, Salvador, Brazil, 26-29 August, 2015. USA: IEEE. pp. 95−102. https://doi.org/10.1109/SIBGRAPI.2015.19
[19]	Zeng G. 2020. On the confusion matrix in credit scoring and its analytical properties. Communications In Statistics-theory And Methods 49:2080−93 doi: 10.1080/03610926.2019.1568485 CrossRef Google Scholar
[20]	Wang L, Qu JJ, Hao X. 2008. Forest fire detection using the normalized multi-band drought index (NMDI) with satellite measurements. Agricultural And Forest Meteorology 148:1767−76 doi: 10.1016/j.agrformet.2008.06.005 CrossRef Google Scholar
[21]	Majid S, Alenezi F, Masood S, Ahmad M, Gündüz ES, et al. 2022. Attention based CNN model for fire detection and localization in real-world images. Expert Systems with Applications 189:116114 doi: 10.1016/j.eswa.2021.116114 CrossRef Google Scholar
[22]	Solovyev R, Wang W, Gabruseva T. 2021. Weighted boxes fusion: Ensembling boxes from different object detection models. Image And Vision Computing 107:104117 doi: 10.1016/j.imavis.2021.104117 CrossRef Google Scholar
[23]	Qu Z, Gao L, Wang S, Yin H, Yi T. 2022. An improved YOLOv5 method for large objects detection with multi-scale feature cross-layer fusion network. Image and Vision Computing 125:104518 doi: 10.1016/j.imavis.2022.104518 CrossRef Google Scholar
[24]	Song C, Zhang F, Li J, Xie J, Chen Y, Zhou H, et al . 2022. Detection of maize tassels for UAV remote sensing image with an improved YOLOX model. Journal of Integrative Agricultur 22:1671−83 doi: 10.1016/j.jia.2022.09.021 CrossRef Google Scholar

		True value
		Positive (real target)	Negative ( non-target)
Predicted value	Positive	True Positive (TP)	False Positive (FP)
Predicted value	Negative	False Negative (FN)	True Negative (TN)

Model	Class	P	R	MAP @0.5	F1	FPS/Frame per second	Weight/ MB
YOlOv5	All	0.778	0.540	0.641	0.64	64.1	14.5
	Fire	0.859	0.655	0.764
	Smoke	0.696	0.426	0.518
YOlOv5 + CAC3	All	0.776	0.585	0.653	0.66	72.5	13.8
	Fire	0.829	0.688	0.774
	Smoke	0.722	0.481	0.531
YOlOv5 + TRC3	All	0.855	0.581	0.697	0.69	54.6	14.5
	Fire	0.903	0.699	0.797
	Smoke	0.806	0.463	0.597
YOlOv5 + αCIOU	All	0.774	0.583	0.651	0.66	61.3	14.5
	Fire	0.818	0.667	0.765
	Smoke	0.729	0.500	0.538
YOlOv5 + CA + αCIOU	All	0.727	0.614	0.673	0.67	60.6	13.8
	Fire	0.832	0.710	0.794
	Smoke	0.622	0.519	0.553
YOlOv5 + TR + αCIOU	All	0.839	0.685	0.724	0.70	58.8	14.5
	Fire	0.860	0.710	0.806
	Smoke	0.818	0.500	0.641

Contents

Policies

Services

Partnerships

Fire detection methods based on an optimized YOLOv5 algorithm

Abstract

Introduction

Materials and methods

Data sources

Calculation and quantification

Data treatment and statistical analysis

Results

Variation of topsoil microbial biomass carbon

Variation of topsoil microbial quotient (MQ)

Topsoil MBC pool distribution with land use types

Correlations of MBC or MQ to soil and climate variables

Discussion

Topsoil microbial biomass carbon pool vs the microbial quotient quantified with land use types

Drivers of topsoil MBC and microbial abundance variation: edaphic versus climatic

Uncertainty and perspectives

Conclusions

Acknowledgments

Conflict of interest

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors