Home Query-Specific Distance and Hybrid Tracking Model for Video Object Retrieval
Article Open Access

Query-Specific Distance and Hybrid Tracking Model for Video Object Retrieval

  • C.A. Ghuge EMAIL logo , Sachin D. Ruikar and V. Chandra Prakash
Published/Copyright: November 17, 2016
Become an author with De Gruyter Brill

Abstract

In the area of modern intelligent systems, the retrieval process of video objects is still a challenging task because objects are usually affected by object confusion, similar appearance among objects, different posing, small size of objects, and interactions among multiple objects. In order to overcome these challenges, the video object is retrieved based on the trajectory points of the multiple-motion objects. However, if an object is in an occlusion situation, the calculation of trajectory points from the objects is considerably altered. In order to overcome the above challenges, we have proposed a technique of query-specific distance and hybrid tracking model for video object retrieval. To verify the performance of the proposed method, five videos were collected from the CAVIAR dataset. Then, the proposed tracking process was applied with these five videos and the performance was analysed based on various parameters, such as precision, recall, and f-measure. From the results, we can prove that the proposed hybrid model attained a higher f-measure of 76.7% compared to that of other existing tracking models, such as the nearest neighbourhood algorithmic model and spatial-exponential weighted moving average model.

1 Introduction

Due to the visual manifestations of objects [3, 9, 11, 21, 25], such as illumination, position, and occlusions [26], the process of object recognition in an image database is one of the major challenging tasks for video object retrieval methods. Moreover, object retrieval from the video is affected easily due to camera motion and the movement of the objects. Based on the aforementioned problems of occlusion and motion, the shape and size of the objects are changed easily [27], which cannot accurately predict the structure of the corresponding object from the video. Accordingly, significant differences occur in the structure of the same object at different times [10]. Then, the video objects are recognised by their shape based on the content-based video object retrieval method [2]. However, the matching techniques are based on the similarity measurement and shape features to resemble the user-applicable object that plays one of the difficult tasks for many researchers [13, 24].

In recent times, many methods of video object retrieval process are developed by various researchers, which have been represented in the literature survey. Basically, the spatial interconnected object with the neighbourhood trajectories [4, 5] are characterised based on the algorithm of conventional video objects detection [11]. Consequently, video object retrieval based on the trajectory extraction process provides an accurate prediction of motion objects from the video. Therefore, the effective retrieval of the video object is mainly based on the travelling path of the corresponding objects. While extracting the objects from the video, the trajectory path for the query input creates a number of problems, such as object motion and similar appearances. Essentially, the trajectory path in the video is mainly based on the timing information, which is used to generate the problems based on indistinctness. For example, two persons are travelling in a similar spatial route with dissimilar speeds; the specification of the trajectory path based on the query input may generate improper results. Many researchers have dealt with video object retrieval using different tracking methods [1, 7, 9, 11, 14, 15, 16, 19, 20].

The major purpose of this research was to retrieve the object from the video using trajectory points. Here, the trajectory path of the object is tracked using the proposed hybrid model, which is designed by the combination of the neighbourhood search algorithm (NSA) [28] and spatial-exponential weighted moving average (spatial-EWMA)-based [22] models. The proposed spatial-EWMA-based tracking mechanism considers the spatial location to track the particular object, and the NSA model is considered for calculating the pixel intensities to track the objects from the video. Then, the proposed model of tracking has been applied to the video retrieval system [17]. Overall, the input videos are given to the feature extraction step to construct the feature library. Here, the features consider two variants of information. The first one is related with the pixel contents of the objects and the second one is related with the behaviour of the movements corresponding to the objects. These two levels of information are stored in the feature library. Then, for the video query, the features are matched with the feature library to find the relevant videos of the users. The main contributions of this paper are as follows.

Contribution 1: In this paper, the hybrid model is used to track video objects effectively, in which the tracking process is done based on two models: the NSA model and the spatial-EWMA model. In the NSA model, the pixel-based information is collected to track the multiple objects from the video. In the spatial-EWMA model, the video objects are tracked using the location-based information. Finally, the integrated results of both the NSA model and the spatial-EWMA model are used to generate the final tracked object results. Thus, the hybrid model is used to track the multiple objects from the video with high precision.

Contribution 2: Basically, the trajectory points present in fast-moving objects do not group properly in the process of object retrieval. In order to overcome this problem, the query-specific distance (QSD) is proposed to calculate the distance between the trajectory point of the query image and the tracked object image to retrieve the objects from the video.

The rest of the paper is organised as follows: Section 2 presents a literature review and the problem statement of the video object retrieval process. Then, the proposed technique for video object retrieval using spatial-EWMA-based trajectory identification is described in Section 3. Experimental setup and performance analysis are shown in Section 4. Finally, in Section 5, concluding remarks are given.

2 Literature Review

In this section, we review the summary of several video object retrieval methods with their importance. Many researchers have dealt with video object retrieval using different tracking methods. Lai and Yang [20] have developed a system based on video retrieval in which to increase the success rate, all the objects in the database are pre-processed to identify potential moving objects and their associated motion trajectories. However, the tracking efficiency was reduced based on the lighting variations and occlusions. In order to obtain a more precise pixel level foreground segmentation, the shortest path algorithm has been developed by Cao et al. [7]. However, the result of the shortest path algorithm does not provide better performance in some vision applications such as video retrieval. Arroyo et al. [1] have developed a tracking method based on the linear sum assignment problem (LSAP), which has been used to provide an efficient tracking model in order to identify people in the video frames along the times. However, the LSAP method does not consider the processing time, and also it can manage only a reduced number of cameras in real time. In order to detect and classify the video objects simultaneously from a video clip, the spatial temporal sampling methods are used.

Chuang et al. [11] have developed the spatial temporal sampling methods, in which the performance of video object detection was reduced based on the detection of multiple video objects in the video and the large data size. Moreover, the contextual reasoning-based video interpretation method was proposed by Gong and Caldas [16], which has been affected based on parameters such as extensional and intentional inference, as well as causal, spatial, and temporal inference. Gómez-Romero et al. [15] have projected an ontology-based contextual reasoning-based video interpretation method [12], which has been used to obtain a high-level interpretation of the scenario. In order to provide more flexibility through a non-linear transformation, two methods – differential geometric trajectory cloud method (DGCT) and kernel principal component analysis (KPCA) – were developed by Gómez-Conde and Olivieri [14]. Basically, the methods of both KPCA and DGCT were difficult to use in identifying the arbitrarily complex human actions in a video shot. Finally, the retrieval of video objects using the Kalman filtering [23] method was proposed by Cheng and Hwang [9], which has been used to generate high tracking accuracy and computational simplicity. However, the process of Kalman filtering is only applicable for linear state transitions. In order to overcome the above drawbacks, we present a technique for video object retrieval using spatial-EWMA-based trajectory identification in this paper.

2.1 Problem Definition

In video retrieval, the type of visual information normally plays a very important role for retrieving a video, especially when there is no textual description regarding a target object. On the other hand, the video retrieval process is dealt with complex object outlooks, deformations, occlusions, and lightings, and thus may also not provide satisfactory results.

In addition, other difficulties to be faced in retrieval arise as a result of the imperfect functioning of video cameras, the wide range of possible scenes, and the necessity of combining data acquired by different sensors.

Another challenge in video object retrieval is the key-object representation of the video object, which should signify the temporal boundaries among key objects.

In Ref. [20], video object retrieval is carried out after extracting the trajectory points using trajectory clustering. The grouping of trajectory points may pose a challenging issue if the background of the videos is not properly extracted due to noise or the presence of fast-moving objects. This affects the trajectory clustering heavily.

Also, the appearance model utilised in Ref. [20] does not consider any machine learning mechanisms for finding user-relevant videos, as we know that machine learning processes are easier methods to bring the user’s intention through the a priori behaviour of the user.

Basically, the main challenge of the video object retrieval process comes from various frequent tasks, such as the similar appearance among objects, different posing, small size of objects, and interactions among the multiple objects. Thus, it is crucial to retrieve the user-applicable object from the video. Another challenge present in video object retrieval is based on the trajectory points. Especially, the grouping of trajectory points may cause a challenging issue if the surroundings of the videos are not appropriately extracted due to noise or the occurrence of fast-moving objects.

3 Proposed Methodology: Design and Development of a Technique for Video Object Retrieval Using Spatial-EWMA-Based Trajectory Identification

The steps involved in our proposed video object retrieval process can be described as follows. (i) At first, read the input video. (ii) Then, extract the object using a nearest neighbourhood algorithm with respect of texture pixels and neighbour pixels. (iii) Apply the spatial-EWMA-based trajectory identification method in the extracted frame to track the object. (iv) Then, the results of both the NSA and spatial-EWMA-based methods are integrated, which is used to generate the hybrid tracking model. (v) Then, the query image is applied by the user to retrieve the objects from the video. (vi) Calculate the QSD between the hybrid model output to the tracked video object, which is related to the query input. (vii) Finally, the retrieved object result from the input video is obtained.

3.1 Finding the Objects from Key Frame Using the Nearest Neighbourhood Algorithm

At first, the key frame is extracted from the input video. After extracting the key frame, the objects presented within the frame should be detected. Then, the detection of the area of the objects is performed using the nearest neighbourhood algorithm, which is used to segment the object area from the key frame. The main advantage of utilising the neighbourhood search algorithm is to adjust the noise pixel, which is used to reduce the overhead of computation. Basically, a large set of data is applied to feature extraction, in which the amount of larger data is reduced and applied to the processing algorithm. Before extracting the feature, the input video is read and the frame is extracted as vt. Here, the representation of vt is defined as the t frame of the input video, which can be represented as follows:

(1)V={vt   1tn},

where V is an input video and vt is defined as the t frame of the input video. Then, the pixel values of each input frame can be expressed as follows:

(2)vt={vtpq1pN1qM},

where p is defined as the column of pixels that are varied from 1 to N and q is defined as the row of pixels that are varied from 1 to M. Basically, the information collected from the pixels is used to predict the object from the input video. However, pixel-based prediction is not always applicable in accurate prediction. In order to improve the detection performance, the location-based method is used with the pixel-based method in this paper.

3.1.1 Nearest Neighbourhood Algorithm

This section shows the algorithm based on the nearest neighbourhood search [28], which is used to track the multiple objects that are shown below.

  1. Using the template function, the reference point is fixed in the input key frame, which is shown as below:

    (3)RktTf(l,m)   1<k<O,

    where Tf is a template function and Rkt is defined as the reference point of the key frame for every kth object at frame t; O is the number of objects.

  2. Based on the above-mentioned key reference point, the reference point of the (t+1) frame is extracted, which can be represented as follows:

    (4)Rkt+1(l,m)=f(Rk),

    where Rkt+1 is defined as the reference point of the kth object at the (t+1) frame, and f(Rk) is defined as the key reference point function.

  3. Calculate the Euclidean distance between the key frame and extracted new frame based on the pixel representation.

    (5)ED(Rkt,Rkt+1)=i=1n(RktRkt+1(l,m))2,

    where ED(Rkt,Rkt+1) is named as the Euclidean distance between the reference points of the key frame Rkt and extracted reference point of the new frame Rkt+1.

  4. Then, the new location of reference point is calculated by increasing or decreasing the initial located reference point that is based on the key reference point, which has been expressed as follows:

    (6)rkt+1(l,m)={l±i,m±j},

    where the value of increasing and decreasing reference point is named as (i, j).

  5. Based on the rkt+1(l,m) representation, the new region is extracted from the input image, in which the Euclidean distance is calculated between the input image and the extracted image.

  6. Execute the above two steps until it reaches the predefined search threshold. Finally, the reference point that has the minimum value of Euclidean distance is considered as the final region of the detected image:

    (7)Rkt+1(l,m)=argMini=1Di,

    where Di is defined as the detected image. Correspondingly, based on the extracted reference point, the detected image is tracked by all frames of the input video that can be represented as follows:

    (8)Tkp={Rkt,Rkt+1,,Rkt+x}   where (t+x)l.

    Equation (8) shows the tracking path of the kth object from the t frame to the t+x frame. The representation of Tkp is defined as the tracking path of the kth object. Here, the value of t+x is based on the length of the tracking path l.

    (9)Dis(p,q)=ED{Rkt,Rkt+1(l,m)}   k.
  7. Based on the length of the tracking path, the object is retrieved from the input frame. Then, the above process is continued for all the objects present in the video.

3.2 Spatial-EWMA-Based Trajectory Identification Using EWMA

This section shows the spatial-EWMA-based trajectory identification [22], in which the object is tracked based on the location. In spatial-EWMA-based trajectory identification, the object is extracted in the form of x, y coordinates from the key frame. Based on the direction of x, y coordinates from the key frame, the matching object is tracked in the next frame. Again, the location of the same object is tracked in the third frame based on the distance between the location of the second frame and the first frame. Using trajectory identification based on the spatial-EWMA-based approach, the object present in the video is tracked accurately. In the spatial-based object tracking model, consider the centre point value of the object for further calculation, which can be shown as below:

(10)zkt+1=αrkt+(1α)zkt,
(11)α=2N+1,

where the kth object in the t frame is defined as rkt;N is defined as the number of periods; and the kth object of the t+1 frame is expressed as rkt+1.zkt+1 is the tracked output of the kth object in the t+1 frame by the EWMA. Then, the spatial-EWMA-based trajectory identification is applied for all the objects. Basically, the tracking process of the t+1 frame is based on the result of the t frame. Similarly, the tracking process of the t+2 frame is based on the results of the t+1 frame. This tracking process is continued based on the length of the tracking path, in which the tracking path length is terminated at the point of t+x.

(12)TkE={zkt,zkt+1,,zkt+x},

where TkE is a length of the tracking path of the kth object. Then, the representations of zkt+1 and zkt are defined as the tracking paths of the t frame and the t+1 frame, respectively.

3.3 Hybrid Model for Video Object Tracking

Basically, the individual performance of both the neighbourhood and the spatial-EWMA-based tracking model does not achieve better tracking compared to the hybrid model. In this section, we integrate both the neighbourhood and spatial-EWMA-based tracking model as a hybrid model, which demonstrates better tracking performance. The major advantages of the NSA model is that it can easily track the object using the colour presence and the shape information; however, the major drawback of this method is that it is not possible to achieve accurate tracking when multiple objects move due to occlusion. This drawback can be easily solved by the spatial model. This model tracks the two-dimensional (2D) position of the objects by considering the 2D location of the last frames. The spatial tracking method is more robust and well suitable for non-normally distributed characteristics. Also, the drawback of the spatial model is that it does not consider the visual of the objects. The reason behind the selection of these two models is that the drawback of one method is the advantage of the other method. When we integrate these two models, the advantages of both methods are reflected in the tracking, where improvement is surely possible.

Moreover, the hybrid model is used to overcome the various multiple object tracking limitations, such as frequent occlusions, parallel manifestation of objects, and interaction between multiple objects. The tracking path based on the hybrid model is represented as follows:

(13)Tk=[C{Rkt+1}+{zkt+1}]k,

where Rkt+1 is defined as the reference point that is selected based on the process of the nearest neighbourhood algorithm, C{Rit+1} is a centre point of the object that we have selected from the t+1 frame, and zkt+1 is defined as the centre point of the object in using the spatial-EWMA-based object tracking method. The final result of the hybrid tracking model is in the form of 2D vector representations.

3.4 Video Object Retrieval Using QSD

This section shows the QSD calculation for the video object retrieval process (Figure 1). After tracking the objects from the video, the user query is applied to retrieve the objects from the video using trajectory points. Basically, the trajectory points of the user query and the tracked object are not same for all the times. When the two objects are travelling in the same spatial route with different speeds, the above trajectory point problem may occur. Due to this problem, the calculation of distance between the query input and tracked object is one of the challenging tasks in video object retrieval. In order to overcome the above challenges, the QSD [18] calculation is proposed this paper.

Figure 1: Block Diagram Representation of the Proposed Video Object Retrieval Methodology.
Figure 1:

Block Diagram Representation of the Proposed Video Object Retrieval Methodology.

3.4.1 Query-Specific Distance

This section shows the video object retrieval method based on the proposed model of QSD, in which the QSD is applied to the feature library. Basically, the feature library contains the details of tracked objects, which have been collected from the proposed hybrid model results. At first, the search path is given by user in the form of a hand-drawing method, which is named as query path. Figure 2 shows the example calculation of QSD, in which the query vector is applied to the tracked object vector. Here, the query vector and the tracked object vector are represented in two dimensions, but the sizes of both representations are different, which is shown below:

Figure 2: Example Calculation of QSD.
Figure 2:

Example Calculation of QSD.

(14)Qxy{1xu1y2}   ;   Tryk{1rl1y2},

where u and l are defined as the size of the query vector and the tracked object vector, respectively. Then, the tracked paths of the object results are compared with the user query image.

Basically, the size of the query vector and tracked object vector is not the same for all the times. Therefore, the distance calculation between the query image and tracked image are handled like in Figure 2. For example, the size of the 2D query image vector and the tracked object vector is considered as three and five, respectively. In order to convert both the query image and the tracked image as the same size, the Euclidean distance is calculated between the query image and the tracked image based on the trajectory points. Finally, the new 2D vectors are generated based on the calculation of Euclidean distance between the query image and the tracked object, which has been named as Pxyk. The size of the new vector Pxyk is the same as that of the query vector size, which is represented as follows:

(15)Pxyk{1xu1y2},
(16)Pxyk=ArgMinr,y(y=12(TrykQxy)2)   {1rP},

Then, the minimum value of the new vector is selected for further calculation with the query vector. Finally, the targeted object is retrieved based on the QSD calculation between the query image and the new vector image, which can be expressed as follows:

(17)QSD=x=1qy=12(QxyPxyk)2,

where Qxy is defined as the query image, which is drawn by the user, and Pxyk is defined as the new matrix that is based on the ED calculation between the query image and the tracked path of several objects. Finally, based on the results of QSD, the object Rxy is retrieved accurately. Figure 3 shows the algorithmic description of the proposed method.

Figure 3: Algorithmic Description of the Proposed Method.
Figure 3:

Algorithmic Description of the Proposed Method.

4 Results and Discussion

4.1 Experimental Setup

This section shows the experimental setup for the proposed video object tracking model, which is implemented using Matlab 8.3 (R2014a) [20] with a system configuration of 4GB RAM Intel processor and 64-bit operating system. The proposed algorithm is experimented using the CAVIAR dataset [8].

Dataset description: The proposed algorithm is experimented using the CAVIAR dataset. Basically, the CAVIAR dataset contains the number of video clips based on various situations, such as oneshop, shopping, threepersons, walkby shop and walking. Here, the video clips are taken by using a wide-angle camera lens. Then, using MPEG2, the compression process is done. Subsequently, the collected file sizes are mostly available between 6 and 12 MB, and some of files are extended in size up to 21 MB [8]. Here, the proposed video object retrieval process is analysed based on five videos, namely OneStopMoveEnter1front, EnterExitCrossingPaths2front, ThreePastShop1cor, and WalkByShop1cor, which have been collected from the CAVIAR dataset. These five videos are selected to analyse the robustness of the algorithm against various backgrounds, fast-moving objects, variety of posing of objects, and variable size of objects.

Evaluation metrics: In this paper, four different metrics are utilised to evaluate the system performance: multiple object tracking precision (MOTP) [6], precision, recall, and f-measure. MOTP is defined as the ratio between the number of matched object hypothesis and the number of matches made. Basically, the MOTP analysis is used to calculate the matched objects present in the overall frames. Using the MOTP performance, the tracker can easily estimate the position of the object. Basically, the MOTP measurement is used to calculate the precision of the tracked object, which denotes the exact positions of estimated persons.

(18)MOTP=i,tdtktMt,

where Mt is defined as the number of matched objects found at time t and Mt is defined as the number of matches done at time t. Precision is also a performance measurement parameter that contains the fraction of retrieved objects to calculate performance based on accuracy. Here, the value of precision is proportional to the intersection of both relevant documents and retrieved documents. Precision takes all the retrieved objects for further calculation. Consequently, the performance measurement based on recall mainly takes all the relevant objects for further calculation. The f-measure is defined as the harmonic mean of precision and recall, which has been described as follows:

(19)Precision=TPTP+FP,
(20)Recall=TPTP+FN,
(21)F=2×(precision, recall)(precision, recall),

where true positive (TP) is correctly identified, false positive (FP) is incorrectly identified, true negative (TN) is correctly rejected, and false negative (FN) is incorrectly rejected.

4.2 Experimental Results

This section shows the experimental tests using the CAVIAR dataset, which is one of the openly available datasets for performing video surveillance application. In order to understand the tracking evaluation, the sequence of four images has been extracted from the five sets of videos using the CAVIAR dataset. The results presented in Figure 4 demonstrate the performance of the hybrid tracking model using the spatial-EWMA and NSA methods. Figure 4A shows the tracking path of every person in the first frame using the proposed hybrid model. Then, the query path plotted in Figure 4B is given by the user in that first frame of video 1. Figure 4C shows the retrieved path of walking persons in frame 1 based on the query input. Similarly, Figure 4D shows the tracked path results of the persons in the shopping mall entrance. Figure 4E shows the query path in the entrance of the shopping mall, which is given by the user. Figure 4F shows the retrieved path based on the user query in video 2. The first frame of the video 3 shows the tracking path of multiple persons. Then, the user query is given by the user, which is shown in Figure 4H. From the multiple persons, the particular number of persons is selected based on the query input. Figure 4J shows the entrance of the shopping mall, in which the persons are tracked based on user query. Then, the retrieved path of persons along the corridor is shown in Figure 4O.

Figure 4: Tracking Results Using the Hybrid Model.First frame of video 1: (A) Paths tracked, (B) Query path, (C) Retrieved path. First frame of video 2: (D) Paths tracked, (E) Query path, (F) Retrieved path. First frame of video 3: (G) Path tracked, (H) Query path, (I) Retrieved path. First frame of video 4: (J) Path tracked, (K) Query path, (L) Retrieved path. First frame of video 5: (M) Path tracked, (N) Query path and (O) Retrieved path.
Figure 4:

Tracking Results Using the Hybrid Model.

First frame of video 1: (A) Paths tracked, (B) Query path, (C) Retrieved path. First frame of video 2: (D) Paths tracked, (E) Query path, (F) Retrieved path. First frame of video 3: (G) Path tracked, (H) Query path, (I) Retrieved path. First frame of video 4: (J) Path tracked, (K) Query path, (L) Retrieved path. First frame of video 5: (M) Path tracked, (N) Query path and (O) Retrieved path.

4.3 Performance Analysis of Object Tracking

This section shows the performance analysis of the object tracking method using the proposed hybrid model. Here, the MOTP performance of the proposed hybrid model is compared with the NSA model and the EWMA model. Basically, the precision of the tracked object is calculated using the MOTP performance results. Consequently, in order to find the performance of the object tracking models, the MOTP performance analysis is used in this paper.

Figure 5 shows the performance analysis of the proposed hybrid model, in which the proposed model is compared with the existing two models of NSA and EWMA. Basically, by increasing the number objects present in the video, the tracking performance is one of the challenging tasks. Figure 5A shows the analysis of object tracking performance in video 1. When the number of objects is fixed as two, the MOTP performance of the various tracking models, such as the NSA model, EWMA model, and hybrid model, is obtained as 74%, 76%, and 79%, respectively. When increasing the number of objects from two to four, the MOTP performance of the NSA model is reduced from 74% to 71% and the performance of the EWMA model is also reduced from 76% to 75%. Consequently, the tracking performance of the hybrid model is reduced from 79% to 76%. However, the performance reduction is much less for the hybrid model compared to other existing models such as NSA and EWMA. Further increasing the number of objects to eight, the MOTP performance of the hybrid model and the NSA model is examined as 71%, and the performance of the EWMA model is achieved as 70%.

Figure 5: Performance Analysis of Object Tracking in (A) Video 1 and (B) Video 2.
Figure 5:

Performance Analysis of Object Tracking in (A) Video 1 and (B) Video 2.

Figure 5B shows the MOTP performance analysis of various tracking algorithms such as the hybrid tracking, NSA, and EWMA models in video 2. Initially, only one person appears on the screen. At that time, the MOTP performance is very high for the NSA model (75.4%). Then, the performance of both the hybrid and EWMA models is attained as 74% and 75.1%, respectively. Further increasing the number objects to two, the hybrid model attains good MOTP performance of 74.9%, which means that the tracking performance is high for the hybrid model. When maintaining the number of objects to two, the tracking performance of the NSA and EWMA models is obtained as 74% and 71%, respectively. Then, the tracking performance of the hybrid model is reduced from 74.9% to 74.6% by increasing the number of objects to three. When the number of objects is three, the tracking performance of the EWMA and NSA models is evaluated as 69% and 71%, respectively.

The MOTP performance of the proposed model is compared with other video object tracking models such as NSA and EWMA. Figure 6A shows the tracking performance analysis of video 3, in which the first frame is considered for further evaluation. When increasing the number of objects, the performance of object tracking is affected gradually. When the number of objects is four, the MOTP performance of the hybrid model is calculated as 74%. At the same time, the tracking performance of other tracking models such as EWMA and NSA is obtained as 70.1% and 73%, respectively. From the above results, we can say that the tracking performance of the EWMA model is much less than that of the other two models (NSA and hybrid models). Moreover, increasing the number of objects to six, the EWMA model achieves a lower tracking performance of 70% and the tracking performance of the hybrid model and the NSA model is attained as 71.3% and 71.1%, respectively.

Figure 6: Performance Analysis of Object Tracking in (A) Video 3 and (B) Video 4.
Figure 6:

Performance Analysis of Object Tracking in (A) Video 3 and (B) Video 4.

Figure 6B shows the MOTP performance analysis of object tracking models for video 4. When the number of objects is increased from two to three, the tracking performance of the hybrid model, NSA model, and EWMA model is acquired as 78%, 76%, and 74%, respectively. When three objects are present in the video, the hybrid tracking model provides better tracking performance. Further increasing the number of objects to six, the tracking models of NSA and EWMA maintained the same tracking performance of 71%. Consequently, the tracking performance of the hybrid model is obtained as 74% for six numbers of objects. Furthermore, increasing the number of objects to seven, the tracking performance of the hybrid model and the NSA model are nearly the same. Then, the tracking performance of the EWMA model is reached as 70% for more than six objects.

In Figure 7, the tracking performance of various tracking models are analysed based on the MOTP parameter. Here, video 5 is considered for tracking performance analysis. When the number of objects present in video 5 is increased as three, better tracking performance is attained by the hybrid tracking model. When the number of available objects is increased from three to four, the tracking performance of the NSA model is greater than that of the EWMA model. Accordingly, when the number of objects present in the video is three, the tracking performance of the NSA and EWMA models is reached as 74% and 73%, respectively. Meanwhile, the tracking performance of the hybrid model is attained as 75%. Further increasing the number of objects from three to four, the tracking performance of the NSA and EWMA models is achieved as 72% and 73%, respectively. Temporarily, the hybrid model tracking performance is obtained as 74%. From the performance results of Figure 7, we can conclude that the tracking performance of the proposed hybrid model is very high compared to that of the other tracking models such as NSA and EWMA.

Figure 7: Performance Analysis of Object Tracking in Video 5.
Figure 7:

Performance Analysis of Object Tracking in Video 5.

4.4 Performance Analysis of Video Object Retrieval

This section shows the performance analysis of video object retrieval based on three parameters of precision, recall, and f-measure. Here, the performance of the proposed model is compared with the NSA model, EWMA model, colour feature-based model [6], and trajectory clustering-based model [20]. After finding the tracked path of the video objects, the proposed QSD calculation is applied for retrieving the video objects. Then, the performance of the video object retrieval process is analysed based on the better precision, recall performance, and f-measure.

Figure 8 (A, C, and E) shows the performance analysis of tracked object retrieval in video 1. Here, the object retrieval precision of both the hybrid model and the NSA model is the same for all the movements. When the number of relevant paths is increased to two, the precision of the hybrid model is achieved as 75.1%, which is the same as that of the NSA model. Meanwhile, the accuracy of the retrieved objects using the EWMA model is very less compared to the aforementioned hybrid model and NSA model. Then, the recall parameter evaluation is carried out in the various tracking models. When increasing the number of relevant paths, the recall parameter measurement is also increased linearly. When the number of relevant paths is two, the analysis based on the recall parameter or hybrid models is attained as 80%. At the same time, the recall measurement of the NSA and EWMA models is obtained as 79% and 74%, respectively. From the above results, we can say that the recall measurement of hybrid analysis is high compared to other tracking models. Then, the combined calculation of both precision and recall is used as the f-score to measure the object retrieval performance. By increasing the number of retrieved paths from two to three, the f-measurement of hybrid analysis is considered as 76%. When maintaining the number of retrieved paths as two, the f-measure value of the NSA and EWMA models is reached as 74% and 72%, respectively. Overall, the proposed model outperformed the existing models such as the NSA model, EWMA model, colour feature-based model [6], and trajectory clustering-based model [20] in terms of precision, recall, and f-measure.

Figure 8: Performance Analysis of Video Object Retrieval for Video 1 and Video 2.(A) Video 1-Precision, (B) Video 2-Precision, (C) Video 1-Recall, (D) Video 2-Recall, (E) Video 1-F-measure, (F) Video 2-F-measure.
Figure 8:

Performance Analysis of Video Object Retrieval for Video 1 and Video 2.

(A) Video 1-Precision, (B) Video 2-Precision, (C) Video 1-Recall, (D) Video 2-Recall, (E) Video 1-F-measure, (F) Video 2-F-measure.

Then, the performance analysis of tracked object retrieval in video 2 is shown in Figure 8B, D, and F. By increasing the number of relevant paths to two, the precision of the NSA model is reduced to 75% from 76%. Then, the precision of both the hybrid model and the EWMA model is maintained the same for all the relevant paths. Then, the recall parameter is calculated for all the video object retrieval models. When fixing the number of relevant paths as three, the recall calculation of the hybrid model, NSA model, and EWMA model is obtained as 82%, 74%, and 75%, respectively. By analysing the recall measurement, the proposed hybrid model achieves the better video object retrieval performance. Then, the f-score analysis is carried out in all the tracking models, in which the f measurement of hybrid model is high compared to the other existing tracking models. When the number of relevant paths is increased from two to three, the f-measurement value of the hybrid model is reached as 76%. Meanwhile, the f-measure calculation of the other two models such as NSA and EWMA is achieved as 75% and 74%, respectively.

Figure 9 (A, C, and E) shows the performance analysis of video object retrieval in the third video using the three parameters of precision, recall, and f-measure. Here, the precision value of the hybrid model is high for all the relevant paths compared to that of the other tracking models of NSA, EWMA, colour feature based [6], and trajectory clustering based [20].

Figure 9: Performance Analysis of Video Object Retrieval for Video 3 and Video 4.(A) Video 3-Precision, (B) Video 4-Precision, (C) Video 3-Recall, (D) Video 4-Recall, (E) Video 3-F-measure and (F) Video 4-F-measure.
Figure 9:

Performance Analysis of Video Object Retrieval for Video 3 and Video 4.

(A) Video 3-Precision, (B) Video 4-Precision, (C) Video 3-Recall, (D) Video 4-Recall, (E) Video 3-F-measure and (F) Video 4-F-measure.

The precision value of the hybrid model is obtained as 74%, in which the number of available relevant paths is considered as two. Then, the precision value measurement of both the NSA and EWMA models is gained as 73% and 70%, respectively. By analysing the performance based on the precision value, the proposed hybrid model achieves the better object retrieval performance. Then, the recall performance is evaluated for object tracking models, in which the recall performance of the NSA and EWMA models is received as 72% and 71%, respectively. Meanwhile, the recall performance of the hybrid model is considered as very high at 82%. From the above analysis, the hybrid model achieves the better tracking performance compared to the existing tracking models of NSA, EWMA, colour feature based [6], and trajectory clustering based [20]. Then, the f-measurement calculation is carried out between the existing tracking algorithms and the proposed hybrid tracking models. Finally, we can conclude that the proposed hybrid model achieves the better tracking performance of objects in video 3.

Then, the performance analysis of video object retrieval in video 4 is shown in Figure 9B, D, and F. At first, the performance of precision is attained as 70% for the EWMA video object tracking model, in which the number of relevant paths is considered as three. At the same number of relevant paths, the performance of precision is calculated for the hybrid model as 71%, which is the same as that of the precision of the NSA model. Then, the recall value of the hybrid model, EWMA model, and NSA model is obtained as 78%, 72%, and 73%, respectively. From the above recall parameter analysis, we can conclude that the proposed hybrid model provides better video object retrieval performance. Then, the f-measure analysis is used to calculate the better retrieval model for video objects. When the number of relevant paths is considered as two, the f-measure value of the proposed hybrid model, EWMA model, and NSA model is acquired as 74%, 73%, and 70%, respectively. Figure 9 shows the performance analysis of video object retrieval using the three parameters of precision, recall, and f-measure in video 5.

Figure 10 shows the performance analysis of video object retrieval using the three parameters of precision, recall, and f-measure in video 5. Initially, the measurement of precision is performed for all the three tracking algorithms to determine the better tracking model. The proposed hybrid model decreases the precision value from 76% to 75% for changing the number of relevant paths from one to two. Meanwhile, the precision value of the NSA and EWMA models is obtained as 72.8% and 72.4%, respectively. Precision measurement alone is not enough to measure the performance. In order to measure the relevant documents, the recall parameter is used. When the number of relevant paths is taken as three, the recall performance of three object tracking models, such as the hybrid model, NSA model, and EWMA model, is achieved as 74.8%, 72%, and 74.8%, respectively. Then, the f measurement is carried out for analysing the better object retrieval model. While considering the number of relevant paths as three, the f-measure value of the hybrid model is obtained as 75%, which is very high compared to the other tracking models such as the NSA model, EWMA model, colour feature-based model [6], and trajectory clustering-based model [20].

Figure 10: Performance Analysis of Video Object Retrieval for Video 5.(A) Video 5-Precision, (B) Video 5-Recall and (C) Video 5-F-measure.
Figure 10:

Performance Analysis of Video Object Retrieval for Video 5.

(A) Video 5-Precision, (B) Video 5-Recall and (C) Video 5-F-measure.

Table 1 shows the average performance of the proposed hybrid model, NSA model, EWMA model, colour feature-based model [6], and trajectory clustering-based model [20]. From the table, we can clearly prove that the proposed hybrid model attains the maximum precision of 0.7517, which is high as compared with the existing models. Similarly, the proposed hybrid model obtained the maximum recall and f-measure of 0.8106 and 0.7670, respectively. This table clearly proves that the proposed hybrid model outperforms the existing models.

Table 1:

Average Performance of the Algorithms.

PrecisionRecallf-Measure
NSA model0.74130.77320.7434
EWMA model0.74130.77480.7581
Colour feature-based model [6]0.74130.77400.7507
Trajectory clustering-based model [20]0.74650.79270.7625
Proposed hybrid model0.75170.81060.7670
  1. Bold values in table 1 indicate better performance.

5 Conclusion

In this paper, we have presented a technique for video object retrieval using QSD and spatial-EWMA-based trajectory identification, in which the object is retrieved based on the trajectory points. At first, the input video object is tracked using the proposed hybrid model in which the extraction of objects is done in two ways, such as with the NSA model and the spatial-EWMA model. Here, the retrieval process of the video object is carried out using the proposed method of QSD, which has been used to calculate the distance between the trajectory points of the query image vector and the tracked object vector. Finally, the proposed hybrid model is compared with other existing algorithms such as the NSA and EWMA models using the parameters of precision, recall, and f-measure. The results clearly prove that the proposed method of hybrid model is used to retrieve video objects effectively with an f-measure of 76.7%. In the future, we desire to apply our method to identify arbitrary complex human actions in full-length featured films.

Bibliography

[1] R. Arroyo, J. J. Yebes, L. M. Bergasa, I. G. Daza and J. Almazan, Expert video-surveillance system for real-time detection of suspicious behaviors in shopping malls, Expert Syst. Appl. 42 (2015), 7991–8005.10.1016/j.eswa.2015.06.016Search in Google Scholar

[2] I. Biederman, Recognition-by-components: a theory of human image understanding, Psychol. Rev.94 (1987), 115–147.10.1037/0033-295X.94.2.115Search in Google Scholar PubMed

[3] W. Brendel and S. Todorovic, Video object segmentation by tracking regions, in: Proceedings of IEEE Conference on Computer Vision, vol. 28, pp. 778–785, 2009.10.1109/ICCV.2009.5459242Search in Google Scholar

[4] T. Brox and J. Malik, Large displacement optical flow: descriptor matching invariational motion estimation, IEEE Trans. Pattern Anal. Mach. Intell.33 (2011), 500–513.10.1109/TPAMI.2010.143Search in Google Scholar PubMed

[5] T. Brox and J. Malik, Object segmentation by long term analysis of point trajectories, in: Proceedings of Computer Vision – ECCV, vol. 6315, pp. 282–295, 2010.Search in Google Scholar

[6] Z. Cai, Y. Liang, H. Hu and W. Luo, Offline video object retrieval method based on color features, Comput. Intell. Intell. Syst.575 (2016), 495–505.10.1007/978-981-10-0356-1_53Search in Google Scholar

[7] X. Cao, F. Wang, B. Zhang, H. Fu and C. Li, Unsupervised pixel-level video foreground object segmentation via shortest path algorithm, Neurocomputing 172 (2016), 235–243.10.1016/j.neucom.2014.12.105Search in Google Scholar

[8] CAVIAR Test Case Scenarios, Available at: http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/, Accessed 25 February, 2016.Search in Google Scholar

[9] H. Y. Cheng and J. N. Hwang, Integrated video object tracking with applications in trajectory-based event detection, Vis. Commun. Image Represent.22 (2011), 673–685.10.1016/j.jvcir.2011.07.001Search in Google Scholar

[10] F. Chevalier, M. Delest and J. Domenger, A heuristic for the retrieval of objects in low resolution video, in: Proceedings of International Workshop on Content-Based Multimedia Indexing, pp. 144–151, 2007.10.1109/CBMI.2007.385404Search in Google Scholar

[11] C. H. Chuang, S. C. Cheng, C. C. Chang and P. P. Chen, Model-based approach to spatial-temporal sampling of video clips for video object detection by classification, Vis. Commun. Image Represent.25 (2014), 1018–1030.10.1016/j.jvcir.2014.02.014Search in Google Scholar

[12] H. Diez-Rodriguez, G. Morales-Luna and J. O. Olmedo-Aguirre, Ontology-based knowledge retrieval, in: Proceedings of International Conference on Artificial Intelligence, pp. 23–28, 2008.10.1109/MICAI.2008.25Search in Google Scholar

[13] B. Erol and F. Kossentini, Shape-based retrieval of video objects, IEEE Trans. Multimed.7 (2005), 179.10.1109/TMM.2004.840607Search in Google Scholar

[14] I. Gómez-Conde and D. N. Olivieri, A KPCA spatio-temporal differential geometric trajectory cloud classifier for recognizing human actions in a CBVR system, Expert Syst. Appl.42 (2015), 5472–5490.10.1016/j.eswa.2015.03.010Search in Google Scholar

[15] J. Gómez-Romero, M. A. Patricio, J. García and J. M. Molina, Ontology-based context representation and reasoning for object tracking and scene interpretation in video, Exp. Syst. Appl.38 (2011), 7494–7510.10.1016/j.eswa.2010.12.118Search in Google Scholar

[16] J. Gong and C. H. Caldas, An object recognition, tracking, and contextual reasoning-based video interpretation method for rapid productivity analysis of construction operations, Autom. Construct.20 (2011), 1211–1226.10.1016/j.autcon.2011.05.005Search in Google Scholar

[17] R. Hu and J. Collomosse, Motion-sketch based video retrieval using a Trellis Levenshtein distance, in: Proceedings of International Conference on Pattern Recognition, pp. 121–124, 2010.10.1109/ICPR.2010.38Search in Google Scholar

[18] Y. Jing, M. Covell, D. Tsai and J. M. Rehg, Learning query-specific distance functions for large-scale web image search, IEEE Trans. Multimed.15 (2013), 2022–2034.10.1109/TMM.2013.2279663Search in Google Scholar

[19] L. Kratz and K. Nishino, Tracking pedestrians using local spatio-temporal motion patterns in extremely crowded scenes, IEEE Trans. Pattern Anal. Mach. Intell.34 (2012), 987–1002.10.1109/TPAMI.2011.173Search in Google Scholar PubMed

[20] Y. H. Lai and C. K. Yang, Video object retrieval by trajectory and appearance, IEEE Trans. Circ. Syst. Video Technol.25 (2015), 1026–103.10.1109/TCSVT.2014.2358022Search in Google Scholar

[21] W. N. Lie and W. C. Hsiao, Content-based video retrieval based on object motion trajectory, in: Proceedings of Multimedia Signal Processing, pp. 237–240, 2002.Search in Google Scholar

[22] K. Liu, B. Liu, E. Blasch, D. Shen, Z. Wang, H. Ling and G. Chen, A cloud infrastructure for target detection and tracking using audio and video fusion, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2015.10.1109/CVPRW.2015.7301299Search in Google Scholar

[23] A. H. Mazinan, A. Amir-Latifi and M. F. Kazemi, A knowledge-based objects tracking algorithm in color video using Kalman filter approach, in: Proceedings of International Conference on Information Retrieval & Knowledge Management, pp. 50–53, 2012.10.1109/InfRKM.2012.6205034Search in Google Scholar

[24] M. Safar, C. Shahabi and X. Sun, Image retrieval by shape: a comparative study, in: Proceedings of IEEE International conference on Multimedia and Exposition, vol. 1, pp. 141–144, 2000.Search in Google Scholar

[25] J. Sivic and A. Zisserman, Efficient visual search of videos cast as text retrieval, IEEE Trans. Pattern Anal.31 (2009), 591–606.10.1109/TPAMI.2008.111Search in Google Scholar PubMed

[26] J. Sivic, F. Schaffalitzky and A. Zisserman, Efficient object retrieval from videos, in: Proceedings of 12th European Conference on Signal Processing, pp. 1737–1740, 2004.Search in Google Scholar

[27] L. F. Teixeira and L. Corte-Real, Video object matching across multiple independent views using local descriptors and adaptive learning, Pattern Recognit.30 (2009), 157–167.10.1016/j.patrec.2008.04.001Search in Google Scholar

[28] P. Turaga and R. Chellappa, Nearest-neighbor search algorithms on non-Euclidean manifolds for computer vision applications, in Proceedings of the Seventh Indian Conference on Computer Vision, Graphics and Image Processing, pp. 282–289, 2010.10.1145/1924559.1924597Search in Google Scholar

Received: 2016-6-22
Published Online: 2016-11-17
Published in Print: 2018-3-28

©2018 Walter de Gruyter GmbH, Berlin/Boston

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Downloaded on 9.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/jisys-2016-0106/html
Scroll to top button