Visualization of large-scale user association feature data based on a nonlinear dimensionality reduction method

Linlin Nong

doi:10.1515/nleng-2025-0141

Enjoy 40% off

academic books on De Gruyter Brill *

Article Open Access

Visualization of large-scale user association feature data based on a nonlinear dimensionality reduction method

Linlin Nong

Published/Copyright: August 8, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Nonlinear Engineering Volume 14 Issue 1

Abstract

The prosperity of data science and the booming growth of the internet industry have made the analysis and processing of large-scale user-related feature data a particularly important issue in modern society. Traditional linear dimensionality reduction methods often fail to obtain intuitive information and discover intrinsic connections and patterns in complex high-dimensional data. This study proposes a large-scale user association feature data (LSUAFD) visualization study based on nonlinear dimensionality reduction methods to address the complexity and high-dimensional issues of data. First, a LSUAFD visualization platform based on the nonlinear dimensionality reduction method is constructed. Secondly, the nonlinear dimensionality reduction algorithm is integrated into the platform, and the functional modules and data dimensionality reduction modules of the platform are finally tested. The results indicated that by combining principal component analysis with nonlinear dimensionality reduction algorithms (t-SNE and LargeVis) to provide interactive parameter adjustment and visualization in the front end, the selected dataset exhibited excellent analytical performance. Furthermore, the method could stably and accurately detect outliers within the dataset. The errors were primarily distributed between 0.4 and −0.4, indicating a high level of precision in the interpretation of data distributions.

Keywords: high-dimensional data; visualization; nonlinear dimensionality reduction; correlation features; PCA algorithm

1 Introduction

With the coming of the big data era, obtaining useful information from massive and high-dimensional user association feature data has become a focus of research for many scholars [1]. Large-scale user association feature data (LSUAFD) have high complexity and relevance. These data are highly correlated and have many potential user behaviors and consumption patterns [2]. However, the processing and parsing of LSUAFD face enormous challenges [3]. Traditional linear dimensionality reduction methods have low computational complexity and relatively simple processing techniques, which have been widely applied [4]. However, with the increasing amount and complexity of data, these methods often find it difficult to capture and obtain important information in high-dimensional data (HDD). This results in poor performance in the processing and analysis of LSUAFD [5]. This study proposes a visualization study of LSUAFD based on the nonlinear dimensionality reduction (NLDR) method, which addresses the high dimensionality and complex processing of LSUAFD. The study uses principal component analysis (PCA), t-distributed random neighbor embedding (t-SNE), and LargeVis as the main NLDR methods. PCA is used to quickly process large amounts of data. The t-SNE and LargeVis are advanced NLDR algorithms that can accurately reveal the intrinsic correlations of HDD and maintain data structure. These methods are integrated into an LSUAFD visualization platform to enhance the visualization and analysis of HDD. The purpose is to find the knowledge and information behind these high-dimensional and massive data in a more effective and efficient way, and to promote the advancement of LSUAFD’s visualization processing technology. The innovation of the research lies in proposing a modular architecture to improve the maintainability and scalability of data processing flow and optimize performance for specific requirements. Secondly, parallel computing technology has been introduced to improve the computational efficiency of large-scale datasets by integrating various dimensionality reduction algorithms such as PCA, t-SNE, and LargeVis, and optimizing algorithm combinations through dynamic parameter adjustment mechanisms. Thirdly, it provides an algorithm selection function based on data features to dynamically select the optimal algorithm. Fourthly, by combining the advantages of nonlinear and linear dimensionality reduction algorithms, a collaborative application method for the comprehensive analysis of complex data structures has been proposed for the first time. Finally, this study designs an interactive visualization interface that integrates the SlickGrid plugin and JS computing module to support users in real-time adjusting parameters and observing dimensionality reduction results, enhancing the intuitiveness of data analysis. The research content revolves around five sections. The first introduction section elaborates on the issue of LSUAFD and its high-dimensional features in the context of information technology development. Section 2 summarizes the research status of the NLDR method, obtaining user-associated feature data (UAFD) in various fields and processing LSUAFD. Section 3 introduces the design of the NLDR-based LSUAFD visualization platform, including platform architecture and functional module design, as well as the implementation of dimensionality reduction algorithms combining PCA, t-SNE, and LargeVis. Section 4 verifies the platform’s functionality through the testing of functional modules and data dimensionality reduction (DDR) modules. Section 5 summarizes the research methods and results.

2 Related works

In recent years, due to its ability to better capture the internal structure and patterns of data, as well as its superiority in processing complex, high-dimensional, and large-scale data, many scholars have conducted research on DDR. Jia et al. introduced 2D dimensionality reduction methods, feature selection, and feature extraction, and analyzed the current mainstream dimensionality reduction algorithms. They provided corresponding application examples for each algorithm and evaluated its advantages and disadvantages. The key objectives of the research were to achieve low loss, maintain the essence of the original data, and obtain optimal low-dimensional data in the process of feature dimensionality reduction [6]. Guo et al. put forward a prior correlation graph construction method to model and discover complex relations between data. This method applied the developed prior correlation graph model to 2 typical data analysis tasks and found that considerable performance could be achieved. This provided an efficient solution for numerous data management tasks in AIoT [7]. Uwaeze et al. explored the feasibility of NLDR methods, including local linear embedding (LLE) and isotropic feature mapping (Isomap), to automatically identify active lesions of multiple sclerosis in brain MRI images without the use of gadolinium contrast agents. The research method included using multi-parameter MRI datasets such as FLAIR and T2-weighted, reconstructing multi-parameter MRI images into a single embedded image through LLE and Isomap, and comparing them with the lesions marked by experts. The results showed that the Dice similarity index of LLE and Isomap was 0.74 ± 0.1 and 0.78 ± 0.09, respectively. The performance was better than the existing methods, indicating that these methods could be used as clinical decision-making tools [8]. Gunduz proposed a classification system based on sound features extracted from individual recordings, which utilizes deep features to train a multi-kernel support vector machine classifier. Compared with the results without dimensionality reduction, the accuracy and MCC rate of this model have been improved by about 9 and 22%, respectively [9]. Toups et al. explored the role of the NLDR method in explaining gene tree inconsistencies. Using tetrapod mitochondrial genome data, the researchers introduced two complex sequence evolution models, the Acovarion model and the Partition model, to improve the model’s fit to the data. The analysis results showed that although these models were better than traditional models in terms of fit, the inconsistency of the gene tree was still significant. This indicated that the observed variation might be due to biological factors rather than simply insufficient model fit [10].

Currently, e-commerce websites are becoming more popular because they are convenient and affordable. Therefore, global netizens are rapidly expanding on a large scale, attracting scholars’ attention. Piñero et al. combined the functions of Cytoscape and DisGeNET with the DisGeNET Cytoscape application to analyze the visualization of mutated disease networks, genes, and disease enrichment of mutations. The experimental results could facilitate the development of repeatable and scalable analytical workflows based on DisGeNET data [11]. Sarker et al. comprehensively introduced mobile data science and intelligent applications in terms of concepts and modeling based on artificial intelligence to improve human life in various daily situations. This experiment provided reference and guidance for mobile application developers and researchers in this field [12]. Szklarczyk et al. described changes in text mining systems, including a new physical interaction scoring mode and extensive user interface functionality for customizing, expanding, and sharing protein networks. They also explored how to use whole-genome experimental data for querying, including automatic detection of rich functionality and potential biases in user query data [13]. Choi et al. studied the rates and related factors of remote medical use among elderly people over 70 years old to understand the importance of acquiring and using knowledge of the Internet and devices in promoting remote medical care. The use of telemedicine has increased from 4.6% before the pandemic to 21.1%. There was a negative correlation between the elderly and low-income populations and the use rate of telemedicine [14]. Fareed et al. updated and checked the health-related internet usage among cancer survivors. Descriptive analysis and weighted multiple logistic regression analysis were conducted on the prevalence, trends, and user profiles of sending emails to doctors and purchasing drugs online. Compared to cancer survivors who use the internet, the prevalence of all other types of HRIU showed an increasing trend year by year [15].

In summary, although there have been many studies on the processing and visualization of LSUAFD, existing methods still have some shortcomings. Firstly, traditional linear dimensionality reduction methods (such as PCA) often have difficulty in capturing the nonlinear structure of data when processing complex HDD, resulting in information loss and insufficient pattern recognition capabilities. Secondly, although existing NLDR methods (such as t-SNE) can reveal the intrinsic structure of data to a certain extent, they are less efficient when processing large-scale data sets and are more sensitive to parameter selection. In addition, most existing studies have failed to fully consider the impact of user attributes (such as gender and age) on data visualization, resulting in insufficient interpretability of visualization results. In response to the above challenges, the study builds a visualization platform that integrates nonlinear and linear dimensionality reduction technologies. PCA is used to achieve efficient dimensionality reduction preprocessing, combined with the nonlinear mapping advantages of t-SNE and LargeVis algorithms, to break through the limitations of traditional methods in analyzing complex data structures. The platform design takes “data preprocessing – feature extraction – multi-algorithm fusion – interactive visualization” as the core process. It realizes full-chain support from data input to knowledge discovery through a modular architecture.

The innovation of this study lies in the complexity and high dimensionality of data processing. By using the NLDR method to construct the LSUAFD visualization platform, effective acquisition and processing of data information can be achieved. The contribution of the research lies in the development of a web-based LSUAFD visualization platform, which utilizes the Django framework and MySQL database for data storage and processing. The designed platform consists of six functional modules: user management, file management, statistical analysis, graph analysis, DDR, and high-dimensional visualization. The front-end interface integrates the SlickGrid plugin and JS visualization tool, optimizing the display and interaction of HDD. The platform has designed gender and age similarity calculation formulas to adjust user attribute similarity and predict target user ratings based on comprehensive rating similarity. The DDR module integrates PCA, t-SNE, and LargeVis algorithms, which reduce data complexity and improve visualization quality through preprocessing, feature selection, and mapping to a low-dimensional space. The module supports users in selecting algorithms and parameters according to different needs, achieving a multi-dimensional visualization display of data.

3 Design of LSUAFD visualization platform based on NLDR Method

In today’s information society, the acquisition and processing of LSUAFD has become particularly crucial, revealing user behavior patterns and consumption habits [16]. However, these high-dimensional and complex data processing and interpretation face enormous challenges, especially in data visualization. This study designs an LSUAFD visualization platform based on a combination of NLDR methods (t-SNE and LargeVis) and PCA, applying it to data visualization. The goal is to effectively obtain massive user information to support decision-making and innovative development in e-commerce and other related fields.

3.1 Construction of NLDR-based LSUAFD visualization platform

In the digital age, the emergence of massive data has brought more challenges of data complexity and multidimensional nature. These data contain immeasurable information value in LSUAFD, but they also pose challenges in data processing and interpretation [17]. This study designs an LSUAFD visualization platform based on a combination of NLDR methods, specifically t-SNE, LargeVis, and PCA. Firstly, PCA performs preliminary processing on the data to reduce its dimensionality to a lower level and lower computational complexity. Next, t-SNE and LargeVis utilize their nonlinear properties to further reduce the dimensionality of the data, aiming to obtain richer structural information. This combination not only effectively reduces the computational burden but also maximizes the preservation of data features. Figure 1 shows the basic architecture of the platform.

Figure 1

Platform architecture design.

The platform adopts B/S architecture, develops backend services based on the Django framework, integrates MySQL database and distributed file storage system, and supports efficient storage and parallel processing of large-scale user data. The front-end part is mainly composed of data display plugins SlickGrid, JS visualization, and calculation modules, which facilitate the intuitive display and fast operation of HDD. The gender similarity coefficients of users u and v are represented by S ( u , v ) . When the user’s gender is consistent, S ( u , v ) = 1 ; when the user’s gender is inconsistent, S ( u , v ) = 0 . The formula for calculating the gender similarity between u and u is shown in Eq. (1).

(1) S ( u , v ) = 1 , S u = S v 0 , S u ≠ S v .

In Eq. (1), S u represents the gender of u . S v represents the gender of v . The formula for calculating the age similarity S ( u , v ) between u and v is shown in Eq. (2).

(2) A ( u , v ) = min 1 , 5 A u − A v .

In Eq. (2), A u represents the age of u . A v represents the v ’s age. By weighting and synthesizing gender similarity and age similarity, the user’s attribute similarity can be obtained, and its calculation formula is shown in Eq. (3).

(3) S i m a t t r ( u , v ) = α ⋅ S ( u , v ) + ( 1 − α ) ⋅ A ( u , v ) .

In Eq. (3), α represents the proportion of gender. ( 1 − α ) represents the proportion of age, and its value range is [ 0 , 1 ] . Compared with the computational complexity (O(N²)) of Uniform Manifold Approximation and Projection (UMAP) on large-scale data, LargeVis optimizes the time complexity to O(N log N) through the hierarchical clustering tree structure. This is more suitable for processing the user data in this study. In addition, t-SNE has a more obvious advantage over Isometric Mapping (Isomap) because t-SNE effectively alleviates the crowding problem when mapping high-dimensional space to low-dimensional space through t-distribution fitting, thus having more advantages in visualization effect. In the data preprocessing stage, PCA is selected as the initial dimensionality reduction step. This is because PCA can quickly reduce the dimension to 50–100 dimensions while maintaining the global structure of the data, providing efficient input for subsequent NLDR algorithms. Especially on the dataset of this study, the fast processing capability of PCA enables subsequent NLDR algorithms to work more efficiently. Further analysis shows that the efficiency of LargeVis is significantly higher than that of other algorithms. The main reason for its high efficiency is closely related to its graph construction method and the characteristics of the dataset. LargeVis uses a local neighbor graph construction method. Compared with the traditional global graph structure, it can more effectively capture the local structure of the data, thereby significantly improving the computational efficiency. At the same time, the dataset of this study has strong inter-point correlation. LargeVis can better handle these correlations and the sparsity of the dataset, thereby reducing unnecessary complexity during calculations and further improving the overall efficiency. In contrast, other algorithms need to process more computing resources when constructing a global graph, resulting in lower efficiency on large-scale datasets. This indicator will directly affect the user feature mapping effect of the visualization platform. The data dimension reduction module integrates PCA, t-SNE, and LargeVis algorithms, supporting the whole process from HDD to low-dimensional visualization. Figure 2 shows the six core functional modules of the platform.

Figure 2

Functional module design framework.

The platform consists of six main functional modules: user management, file management, statistical analysis, graph analysis, DDR, and high-dimensional visualization. Through this platform, LSUAFD can be directly calculated, thereby visualizing and displaying massive user information. Using user interest feature data to predict ratings can increase data density, which can reflect user similarity. The mathematical expression for the similarity of ratings between users u and v is shown in Eq. (4).

(4) Sim sco ( u , v ) = ∑ i ∈ I u v ( R u , i − R u ¯ ) ( R v , i − R v ¯ ) ∑ i ∈ I u v ( R u , i − R u ¯ ) 2 ∑ i ∈ I u v ( R v , i − R v ¯ ) 2 .

In Eq. (4), Sim sco ( u , v ) is the similarity in scores between u and v . After obtaining user rating similarity and attribute similarity, a weighted synthesis is performed to calculate the comprehensive similarity. The formula is shown in Eq. (5).

(5) Sim ( u , v ) = λ ⋅ Sim attr ( u , v ) + ( 1 − λ ) ⋅ Sim sco ( u , v ) .

In Eq. (5), Sim attr ( u , v ) means the user attributes similarity. Sim sco ( u , v ) is the user ratings similarity. λ , ( 1 − λ ) ∈ [ 0 , 1 ] represents the weight of user attribute characteristics and ratings in user similarity. The larger the weight, the greater the impact on user similarity. After obtaining the similarity and similarity of users, a rating prediction is made for target user u based on the rating data of the target user similarity and the target user rating. Therefore, the calculation for user u 's predicted rating P u , i for product i is shown in Eq. (6).

(6) P u , i = R u ¯ + ∑ v ∈ S ( u ) Sim ( u , v ) ⋅ ( R v , i − R v ¯ ) ∑ v ∈ S ( u ) ∣ sim ( u , v ) ∣ .

In Eq. (6), v represents a similar user of user u . S ( u ) is the set of similar neighbors of u . Sim ( u , v ) is the comprehensive similarity between u and v . R v , i represents v 's rating of product i . R v ¯ represents the average rating of v for all products. DDR not only reduces computational complexity and improves algorithm efficiency but also facilitates data visualization and understanding [18]. The basic principle and process of the DDR module are shown in Figure 3.

Figure 3

The principle and basic flow of nonlinear data reduction. (a) Linear dimensionality reduction, (b) nonlinear dimensionality reduction, and (c) nonlinear data dimensionality reduction process.

DDR converts HDD into low-dimensional data while preserving important features and structural information of the original data. DDR is the process of selecting representative features from the original features and constructing new features based on the original features. First, the original HDD is preprocessed, and then, an appropriate dimensionality reduction algorithm is selected, and the corresponding parameters are set. Secondly, by utilizing the selected or extracted features, HDD is mapped to a low-dimensional space. Finally, for the dimensionality-reduced data, machine learning algorithms can be applied, or the internal information of the data can be visualized directly.

3.2 Implementation of NLDR-based LSUAFD visualization platform

In the data-driven modern society, LSUAFD provides researchers with a rich source of information, but it also brings complexity to processing and parsing. Data visualization as an intuitive explanatory tool has enormous potential and challenges in processing such complex data. Therefore, in this study, the LSUAFD visualization platform combines three different dimensionality reduction algorithms: PCA, t-SNE, and LargeVis. Among them, PCA is used for its efficiency in processing large amounts of data, while t-SNE and LargeVis are advanced NLDR algorithms that can accurately reveal the inherent correlations in HDD while maintaining the data structure [19]. Through the comprehensive application and optimization of these three algorithms, different application scenarios and analysis needs can be satisfied. The basic PCA principle is shown in Figure 4.

Figure 4

Basic principles of PCA.

The DDR algorithm is a widely used linear dimensionality reduction technique. It maps the original data into a new coordinate system through coordinate transformation to achieve the goal of reducing data dimensions and preserving key features in the data. The expression for the distance between the original data point and the dimensionality reduced data point x ˆ i is shown in Eq. (7).

(7) ∑ i = 1 m ∑ j = 1 d ′ z i j w j − x i 2 2 = ∑ i = 1 m z i T z i − 2 ∑ i = 1 m z i T W T x i + cos t α − t r W T ∑ i = 1 m x i x i T W .

In Eq. (7), w i represents the standard orthogonal basis vector. If some coordinates in the newly obtained coordinate system are discarded and the dimension is reduced from d to d ' , the projection of data point x i in the low-dimensional space is z i . z i j = w j T w j represents the j th coordinate of x i in a low-dimensional space. The mathematical expression for the variance of data points in the maximum low-dimensional space based on maximum separability is shown in Eq. (8).

(8) max W t r ( W T x i x i T W ) s ⋅ t ⋅ W T W = 1 .

In Eq. (8), W T x i represents the projection of data point x i in a low-dimensional space, which is then solved using the Lagrange multiplier method and eigenvalue decomposition. The DDR algorithm discards the feature vector corresponding to the smallest individual feature value to achieve DDR, achieving a denoising effect to some extent. The t-SNE algorithm belongs to the NLDR algorithm and is suitable for reducing the HDD to 2D or 3D for visual display [20]. The basic principle of the t-SNE algorithm is shown in Figure 5.

Figure 5

Basic principles of t-SNE method.

The t-SNE algorithm constructs a probability distribution between high-dimensional spatial data points, where similar data points correspond to higher probabilities and dissimilar data points correspond to lower probabilities. The calculation formula is shown in Eq. (9).

(9) p j ∣ i = exp ( − ∥ x i − x j ∥ 2 / ( 2 σ i 2 ) ) ∑ k ≠ i exp ( − ∥ x i − x k ∥ 2 / ( 2 σ i 2 ) ) .

In Eq. (9), p j ∣ i represents a higher probability for similar data points. σ i represents the confusion level of the main parameters. The calculation formula for the low probability corresponding to dissimilar data points is shown in Eq. (10).

(10) p i j = p j ∣ i + p i ∣ j 2 n .

In Eq. (10), p i j represents a lower probability for dissimilar data points. p i ∣ j represents the number of nearest neighbors of high-dimensional and low-dimensional distributions of data points. The goal of the t-SNE algorithm is to calculate the similarity between mappings y i , … , y d , y i ∈ R d ′ , y i , and y j in a low-dimensional space, as shown in Eq. (11).

(11) q i j = ( 1 + ∥ y i − y j ∥ 2 ) − 1 ∑ k ≠ l ( 1 + ∥ y k − y l ∥ 2 ) − 1 .

In Eq. (11), q i j represents the similarity between y i and y j . y i and y j represent mappings in low dimensional space. The similarity between high and low dimensional distributions is measured by KL divergence, and its calculation formula is shown in Eq. (12).

(12) C = ∑ i ≠ j p i j log p i j q i j .

In Eq. (12), C represents the similarity between two dimensions, which minimizes KL divergence through gradient descent. The formula for calculating the conditional probability distribution in low-dimensional space is shown in Eq. (13).

(13) P ( e i j = 1 ) = f ( ∥ y ˆ i − y ˆ j ∥ ) .

In Eq. (13), f ( x ) = 1 1 + exp ( x 2 ) represents the calculation of the entire dataset. The calculation formula for the optimization objective is shown in Eq. (14).

(14) O = ∑ ( i , j ) ∈ E w i j p ( e i j = 1 ) ∑ i , j ∈ E γ ( 1 − p ( e i j = 1 ) )

In Eq. (14), γ represents the weight of the negative sample edge. The LargeVis algorithm has better time complexity than the t-SNE algorithm, providing an effective dimensionality reduction method for the visualization of large-scale HDD. The process of the DDR module based on three different dimensionality reduction algorithms is shown in Figure 6.

Figure 6

Data dimension reduction module process.

The DDR module process shown in Figure 6 mainly consists of three parts: algorithm selection, parameter adjustment, and visualization. PCA, t-SNE, and LargeVis algorithms are used for dimensionality reduction of HDD.

4 Testing of LSUAFD visualization platform based on NLDR method

In the era of HDD, providing a reliable and effective data visualization platform has become crucial. To verify the performance, accuracy, and reliability of the platform, module functional testing and DDR module testing will be conducted on the platform. The performance and stability of the platform in practical applications are verified to explore the impact of NLDR technology on the visualization quality of data and analyze the internal relationships of data. This has driven further optimization of the NLDR method, facilitating deep analysis of large-scale user data visualization.

This study utilizes multiple publicly available datasets, including the MNIST handwritten digit dataset, the iris dataset from the UCI machine learning library, and the food nutrition component dataset on Kaggle. The MNIST dataset contains approximately 70,000 28 × 28 pixel handwritten digit images, suitable for dimensionality reduction and visualization research. The Iris dataset contains 150 samples with features including sepal length, sepal width, petal length, and petal width, suitable for testing classification and clustering algorithms. The food nutrition dataset covers detailed nutritional information of nearly 2,000 types of food, suitable for analyzing and comparing the characteristics of different foods.

Before applying NLDR methods, all data undergo strict preprocessing steps. This includes data cleaning to remove missing and outlier values, ensure data integrity, and perform mean imputation on missing values to maintain the continuity of the dataset. Normalization processing is also applied to all features to eliminate the influence of different feature dimensions and reduce feature values to the [0,1] interval. In addition, feature selection is conducted based on correlation analysis to remove redundant features and reduce computational complexity. In the Iris dataset, the two most representative features are selected for subsequent analysis based on the correlation between petals and sepals.

4.1 Module functional testing

The experimental data need to be preprocessed. The KNN interpolation method is used to handle missing values during data cleaning, and outliers are identified and corrected by the IQR method. Z-score normalization is then used to scale the feature values to the [−1, 1] interval. Then, based on the variance threshold method (retaining features with variance >0.5) and correlation coefficient analysis (eliminating redundant features with |r| > 0.9), the MNIST dataset is preprocessed to reduce the dimensions from 784 to 216, and the protein dataset is reduced from 103 to 68. The parameters of the t-SNE algorithm are determined by grid search: the perplexity range is [15,50], the learning rate is [100, 500], and the number of iterations is [500, 2,000]. After cross-validation, the optimal parameter combination for the MNIST dataset is perplexity = 251, learning rate = 0.5, iter = 1,215, at which point the KL divergence is stable at 0.82 ± 0.05. The LargeVis algorithm uses an adaptive learning rate strategy, with the initial learning rate set to 10 and decayed by 5% every 500 iterations, ultimately achieving a dimensionality reduction difference of 0.12 on the protein dataset.

To verify the module functionality of the LSUAFD visualization platform designed for this study, three specific datasets are used for testing and validation. One is the MNIST handwritten digit database, which uses CSV format and contains 784 features and 1 label, with a total of 10,000 data items. The second is the protein structure and function set, which has 103 features and 14 labels, totaling 1,500 data items. The third is the diversity of food nutrition, with 14 features and 1 label, totaling 7,637 pieces of information. Each dataset is used to demonstrate the platform’s ability to process NLDR data. At the same time, the t-SNE algorithm is used in the study, with a perlexity of 30, 1,000 iterations, and a learning rate of 200, to balance local and global information. Regarding the O (N²) computational complexity of t-SNE, the Barnes Hut algorithm is used for approximate calculation, significantly improving processing efficiency. The parameter configuration of the LargeVis algorithm is set to 15 nearest neighbors (K), 0.1 learning rate, and 1,000 iterations (num_iter). These parameters have been optimized to ensure dimensionality reduction quality and improve computational efficiency. The computational complexity of the LargeVis algorithm is O (N log N), making it suitable for processing large-scale datasets. The testing environment used in the experiment is shown in Table 1.

Table 1

Test environment hardware and software configuration parameters

Hardware configuration	Software configuration
CPU:Intel Core(TM) i5-4210U CPU @1.70 GHz 2.40 GHz	Operating System: windows 10(64-bit)
Memory: 4GB RAM	Browser version: Chrome 60.0.3112.113
—	Database version: MySQL 5.7.17
—	Python version: Python 3.5.2

In Table 1, the test environment parameters are Intel Core i5-4210U CPU and 4GB RAM. Its running window is 64-bit Windows 10, the browser is Chrome version 60.0.3112.113, the database is MySQL 5.7.17, and the Python version is 3.5.2. The role of these parameters in executing software and data processing tasks will be further discussed below. The test results of the data density distribution function are shown in Figure 7.

Figure 7

Data density distribution functional testing.

In Figure 7, there is a clear variation pattern between information carrying capacity and data density. In the 0–10 and 60–90 intervals of information carrying capacity, data density shows a very high level, both exceeding the threshold of 1,000. However, when the information carrying capacity is between 10 and 50, the data density is lower than the density values in other intervals, all below 400. This indicates that the volatility of data density highlights the impact of information carrying capacity on data generation. The results of box plot distributed functional analysis tests on odd-numbered datasets, such as the MNIST handwritten digit database, and even-numbered datasets, such as protein structure and function datasets, are shown in Figure 8.

Figure 8

Box-chart-distributed functional analysis tests on different data sets. (a) Test of box chart distribution function analysis on odd numbered data sets, (b) box chart distribution function analysis test for even data sets.

In Figure 8(a), the platform performs well in analyzing odd datasets and can accurately identify outliers in the dataset. The error is mainly distributed between 0.4 and −0.4, indicating that the interpretation of the distribution is quite accurate. In Figure 8(b), for even-numbered datasets, the platform’s analytical capabilities are relatively weak. The distribution of errors is relatively uneven, and the errors are mainly between 0.5 and −0.3. The user satisfaction with the visualization platform is shown in Figure 9.

Figure 9

User satisfaction with visual platforms.

In Figure 9, there is a clear correlation between the increase in user usage of the visualization platform and their satisfaction. As users use visualization platforms more frequently, the number of individuals with a positive satisfaction index remains in the majority. However, the number of users with a negative satisfaction index shows a gradually decreasing trend. In the specific satisfaction index, the index of users who are satisfied with the visualization platform is generally 0.29. The index of users who are dissatisfied with the visualization platform is mostly −0.32. The problems or difficult-to-understand and operate functions encountered by these users during the use of the platform have led to a decrease in satisfaction. This visualization platform reflects the relationship between the platform’s effectiveness and user satisfaction, providing valuable information for future improvement and optimization.

4.2 DDR module testing

In the LSUAFD visualization platform, the DDR module implements PCA, t-SNE, and LargeVis algorithms, to verify the platform’s dimensionality reduction capability by applying these dimensionality reduction methods to three datasets. The data after dimensionality reduction are presented in a 2D view. Testing the protein structure and function dataset to ensure that the platform has the ability to process various data types for dimensionality reduction. The dimensionality reduction algorithm and data parameters are shown in Table 2.

Table 2

Dimensionality reduction algorithm and data parameters

Serial number	Dataset name	PCA algorithm	t-SNE algorithm	LargeVis algorithm
1	Handwritten numerals database MNIS dataset	Variance retention 21%	perp: 251	Learning rate: 10
			Learning rate: 0.5	Trees: 55
			Iter: 1,215	Neigh: 120
				Others: default
2	Protein structure function dataset	Variance retention 18.2%	perp: 15	Learning rate: 5
			Learning rate: 15	Trees: 35
			Iter: 1,258	Neigh: 90
				Others: default
3	Food nutrient content dataset	Variance retention 99.1%	perp: 25	Learning rate: 7
			Learning rate: 15	Trees: 40
			Iter: 356	Neigh: 70
				Others: default

In Table 2, the PCA algorithm demonstrates superior dimensionality reduction performance in food nutrition data, with a variance retention rate of 99.1%. However, the three DDR methods are not ideal for dimensionality reduction of protein datasets. The t-SNE algorithm and LargeVis algorithm highlight the non-linear properties of the data. The LargeVis algorithm significantly improves time efficiency. The dimensionality reduction results of the protein structure and function dataset are shown in Figure 10.

Figure 10

Protein structure function dataset dimension reduction presentation. (a) OCA algorithm, (b) t-SNE algorithm, and (c) LargeVis algorithm.

Figure 10 shows the effects of three different algorithms on the dimensionality reduction of protein structure function datasets. The horizontal axis represents the normalized data distribution range, from 0 to 1. The vertical axis represents the difference degree, which is used to measure the structural difference between the data after dimensionality reduction and the original data. The smaller the value, the closer the dimensionality reduction effect is to the structure of the original data. The LargeVis algorithm performs better among these algorithms. Its dimensionality reduction effect is more complete and clear, with a difference of only 0.12, showing a higher accuracy. However, both the PCA and t-SNE algorithms have shortcomings in dimensionality reduction. In Figure 10(a), the dimensionality reduction of the PCA algorithm is not complete, with a difference of 0.53. In Figure 10(b), the dimensionality reduction result of the t-SNE algorithm is missing, with a difference of 0.34. Therefore, LargeVis’s superior performance in dimensionality reduction, especially in maintaining data integrity and clarity, is particularly evident. The results of the scatter matrix graph for group and categories dimension data are shown in Figure 11.

Figure 11

Scatter matrix graph under four-dimensional data. (a) Group the dimensionality reduction effect of the data set, (b) group visualization of the data set, (c) visualization of the calcium data set, and (d) dimensionality reduction effect of calcium data set.

In Figure 11, there is a small difference in dimensionality reduction and visualization effects between the group and categories dimensions of the data. In Figure 11(a) and (d), the group dataset shows a denser distribution density after dimensionality reduction, reaching 0.63. The scales dataset is relatively sparse after dimensionality reduction, but has rich information, with a distribution density of up to 0.78. In Figure 11(b) and (c), the visualization effect of the categories dataset is better than that of the group dataset, achieving a visualization accuracy of 89%. Therefore, the platform has good dimensionality reduction and visualization capabilities when processing different types of datasets. The performance evaluation comparison of dimensionality reduction algorithms is shown in Table 3.

Table 3

Comparison of performance evaluation of dimensionality reduction algorithms

Indicators/algorithms	PCA	t-SNE	LargeVis	PaCMAP [21]	ISOMAP	Algorithm proposed by the research
Accuracy (%)	74.32	81.76	94.87	96.62	84.51	97.38
F1 score	0.69	0.79	0.84	0.88	0.75	0.91
Efficiency (processing time, seconds)	118.24	297.16	92.45	76.34	235.68	63.81
Adaptability score (1–10)	7.3	5.9	8.8	9.1	8.5	9.3

As shown in Table 3, accuracy refers to the proportion of original label information retained in the data after dimensionality reduction, which is a reflection of the ability to maintain structure. The F1 score comprehensively measures the precision and recall of outlier detection. The higher the value, the better the detection effect. Efficiency is evaluated by the time (in seconds) required to process the MNIST dataset (containing 10,000 samples). The adaptability score is based on the algorithm’s generalization ability to different data types (linear/nonlinear) and sample sizes (1–100k), and is rated on a scale of 1–10. The proposed algorithm achieves an accuracy of 97.38%, significantly higher than LargeVis’s 94.87%, PaCMA’s 96.62%, ISOMAP’s 84.51%, PCA’s 74.32%, and t-SNE’s 81.76%. In terms of outlier detection capability, the proposed algorithm has an F1 score of 0.91, leading LargeVis with 0.84, t-SNE with 0.79, ISOMAP with 0.75, PCA with 0.69, and PaCMA with 0.88. In the efficiency evaluation, the proposed algorithm performs the best with a processing time of 63.81 s, outperforming LargeVis’s 92.45 s, PaCMA’s 76.34 s, t-SNE’s 297.1 s, ISOMAP’s 235.68 s, and PCA’s 118.2 s. In terms of adaptability score, the proposed algorithm ranks first with a score of 9.3, followed closely by LargeVis and PaCMA with scores of 8.8 and 7.6, respectively. ISOMAP, t-SNE, and PCA have adaptability scores of 8.5, 5.9, and 7.3, respectively. These results indicate that the proposed algorithm demonstrates excellent performance in multiple key performance indicators, particularly in terms of accuracy, outlier detection, and processing efficiency. To verify the performance of UMAP in dimensionality reduction tasks, this study conducts experimental comparisons among UMAP, and PCA, t-SNE, and LargeVis. By using the MNIST handwritten digit dataset, protein structure function dataset, and food nutrition component dataset, the accuracy, running time, and completeness of the data distribution after dimensionality reduction of each algorithm are evaluated. The experimental results are shown in Table 4.

Table 4

Performance evaluation of multidimensional scale analysis algorithm

Algorithm	Accuracy (%)	F1 Score	Runtime (seconds)	Data distribution completeness
PCA	74.32	0.69	118.24	Moderate
t-SNE	81.76	0.79	297.16	Good
LargeVis	94.87	0.84	92.45	Excellent
UMAP	96.22	0.89	76.12	Excellent
Our algorithm	97.38	0.91	63.81	Outstanding

Table 4 shows that UMAP outperforms t-SNE in terms of runtime and data distribution integrity after dimensionality reduction, and performs similarly to LargeVis, but slightly inferior to the optimization algorithm proposed in this study. By combining the characteristics of UMAP, this study further validates the flexibility and diversity of the platform. In the comparative experiment, a total of 30 users are recruited and used different platforms for data visualization tasks. Based on user feedback and experimental data, several indicators are evaluated, as shown in Table 5.

Table 5

Data visualization tools performance comparison

Tool	User-friendliness score	Interactivity score	Data processing time (seconds)	Visualization quality score
Our platform	9.2	9.5	12.4	4.8
Tableau	8.5	8.3	15.6	4.5
Power BI	8.0	8.0	14.2	4.2
D3.js	7.5	7.2	20.1	4.0

Table 5 shows that the research platform outperforms other tools in terms of user friendliness and interactivity ratings, and significantly reduces data processing time, indicating its advantages in practical use. In addition, in terms of visualization quality, this research platform has also received high ratings, further supporting its effectiveness in complex data visualization. To more comprehensively evaluate the performance of the proposed algorithm and other dimensionality reduction methods, indicators such as recall, precision, and ROC-AUC on the validation set are calculated, as shown in Table 6.

Table 6

Performance evaluation metrics on the validation set

Algorithm	Recall (%)	Precision (%)	ROC-AUC
Proposed algorithm	95.2	96.4	0.983
LargeVis	93.5	94.2	0.971
PaCMA	94.0	95.0	0.975
ISOMAP	88.5	89.0	0.950
PCA	85.0	86.0	0.920
t-SNE	87.0	88.0	0.935

From Table 6, the proposed algorithm achieves a recall of 95.2% on the validation set, indicating that it has a high sensitivity in detecting true outliers. The precision of 96.4% further illustrates the accuracy of the algorithm in identifying outliers, with a very low false positive rate. The ROC-AUC value of 0.983 highlights the excellent ability of the algorithm in distinguishing normal data from abnormal data, which is a near-perfect performance. In contrast, LargeVis achieves a recall of 93.5% and a precision of 94.2% on the validation set, with a ROC-AUC value of 0.971. PaCMA achieves a recall of 94.0%, a precision of 95.0%, and an ROC-AUC value of 0.975. ISOMAP, PCA, and t-SNE also show good performance, with recall and precision ranging from 85.0 to 89.0%, and ROC-AUC values ranging from 0.920 to 0.950. Therefore, the proposed algorithm outperforms these methods in terms of recall, precision, and ROC-AUC, showing its superior performance in anomaly detection and data discrimination.

In this study, based on the evaluation of the performance of the dimensionality reduction algorithm, the sensitivity of parameter selection to the results is further analyzed. In particular, the number of neighbors in the LargeVis algorithm and the learning rate in the t-SNE algorithm are discussed in depth. The experimental results show that the increase in the number of neighbors in the LargeVis algorithm helps to better capture the local structure of the data. However, when the number exceeds 100, the performance improvement is limited, and the calculation time increases significantly, indicating that there is an optimal value for the number of neighbors. For the t-SNE algorithm, the adjustment of the learning rate has a significant impact on the convergence speed of the algorithm and the stability of the results. A moderate learning rate (about 20) can strike a balance between ensuring the convergence speed of the algorithm and the accuracy of the results. In summary, parameter selection has a significant impact on the performance of the algorithm, and reasonable parameter settings are crucial to maintaining algorithm performance and computational efficiency. Therefore, in practical applications, the parameters need to be carefully adjusted and optimized according to the specific dataset and application scenario.

Overall, domain impact analysis reveals the potential applications of visualization platforms in multiple fields such as market analysis, bioinformatics, and social network analysis. In market analysis, the platform helps businesses identify market trends and consumer preferences and optimize market strategies by processing consumer behavior and sales data. In the field of bioinformatics, the platform’s dimensionality reduction display of gene expression and protein interaction networks has promoted a deeper understanding of disease mechanisms and accelerated the discovery of biomarkers. In terms of social network analysis, platforms can reveal the characteristics of network structure and information dissemination paths, providing support for identifying key influential nodes and optimizing information dissemination strategies. Interdisciplinary applications have further expanded the practicality of the platform, such as combining market and social network analysis to provide a more comprehensive marketing perspective. Technological advancements, including algorithm optimization and enhanced computing power, have provided possibilities for processing large-scale datasets, while driving research and practice in related fields.

5 Conclusion

Focusing on the rapid growth of information and features in today’s large-scale data environment, there is a complexity of UAFD. Due to the increasing proliferation of information data, it is hoped to analyze these data through powerful data visualization tools to understand their correlations and patterns. This study proposed a visualization study of LSUAFD based on the NLDR method to address this challenge. It was intended to provide a deeper understanding of large-scale data processing and visualization, with the aim of providing a meaningful theoretical basis for the fields of data science and information visualization. Experiments have shown that for even-numbered datasets, the platform’s analytical ability was relatively weak, and the distribution of errors was relatively uneven, with errors mainly ranging from 0.5 to −0.3. It could be seen that not all datasets performed consistently on this platform, and the characteristics of specific datasets might have an impact on analysis accuracy. The LargeVis algorithm performed more prominently compared to other algorithms, with a more complete and clear dimensionality reduction effect, with a dimensionality reduction difference of only 0.12, demonstrating high accuracy. However, both PCA and t-SNE algorithms had shortcomings in dimensionality reduction. The dimensionality reduction of the PCA algorithm was not complete enough, with a difference of 0.53. The dimensionality reduction results of the t-SNE algorithm were missing, with a difference of 0.34. LargeVis had superior performance in dimensionality reduction, especially in maintaining data integrity and clarity. Therefore, the design of the LSUAFD visualization platform based on the NLDR algorithm is a highly interactive and user-friendly visualization platform. Although this study has made some progress in building an NLDR and visualization platform, it also has some limitations. First, the datasets used are all from public repositories, which may cause sample selection bias. For example, the MNIST dataset mainly contains American handwriting styles, but not those of other countries, which may affect the generalization ability of the model. Second, the research results are only based on MNIST, Iris, and food nutrition datasets, and the generalizability to other data types, such as time series or text, has not been verified. Since different data structures may lead to differences in dimensionality reduction effects, future research should be extended to more types of data to evaluate the generalizability of the platform. In addition, this study only compared PCA, t-SNE, and LargeVis algorithms without covering other emerging dimensionality reduction techniques such as UMAP, which may limit the comprehensive understanding of the dimensionality reduction results. The experimental verification is mainly for structured numerical data (such as MNIST images, protein features, and food nutrients). Its generalization in unstructured data types such as text (such as IMDb comments) and medical images (such as MRI scans) still needs to be further verified. The high-dimensional sparsity of text data (e.g., a vocabulary of more than 100,000) may cause the perplexity parameter adjustment of t-SNE to fail. The spatial correlation of medical images requires the combination of convolutional feature extraction. Therefore, future research will explore the combination of domain-specific preprocessing (such as text vectorization, image feature engineering) with the platform algorithm to expand its scope of application.

Funding information: The author states no funding is involved.
Author contributions: The author has accepted responsibility for the entire content of this manuscript and approved its submission.
Conflict of interest: The author states no conflict of interest.
Data availability statement: All data generated or analyzed during this study are included in this published article.

References

[1] Zebari R, Abdulazeez A, Zeebaree D, Zebari D, Saeed J. A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J Appl Sci Technol Trends. 2020;1(1):56–70.10.38094/jastt1224Search in Google Scholar

[2] Jia W, Sun M, Lian J, Hou S. Feature dimensionality reduction: a review. Complex Intell Syst. 2022;8(3):2663–93.Search in Google Scholar

[3] Fresca S, Manzoni A. POD-DL-ROM: Enhancing deep learning-based reduced order models for nonlinear parametrized PDEs by proper orthogonal decomposition. Comput Methods Appl Mech Eng. 2022;388(1):114–81.10.1016/j.cma.2021.114181Search in Google Scholar

[4] Jarman HK, Marques MD, McLean SA, Slater A, Paxton SJ. Motivations for social media use: Associations with social media engagement and body satisfaction and well-being among adolescents. J Youth Adolesc. 2021;50(12):2279–93.10.1007/s10964-020-01390-zSearch in Google Scholar PubMed

[5] Smarandache F. Plithogeny, plithogenic set, logic, probability and statistics: a short review. J Comput Cognit Eng. 2022;1(2):47–50.10.47852/bonviewJCCE2202191Search in Google Scholar

[6] Jia W, Sun M, Lian J, Hou S. Feature dimensionality reduction: a review. Complex Intell Syst. 2022;8(3):2663–93.10.1007/s40747-021-00637-xSearch in Google Scholar

[7] Guo T, Yu K, Aloqaily M, Wan S. Constructing a prior-dependent graph for data clustering and dimension reduction in the edge of AIoT. Future Gener Comput Syst. 2022;128(5):381–94.10.1016/j.future.2021.09.044Search in Google Scholar

[8] Uwaeze J, Narayana PA, Kamali A, Braverman V, Jacobs MA, Akhbardeh A. Automatic active lesion tracking in multiple sclerosis using unsupervised machine learning. Diagnostics. 2024;14(6):632–46.10.3390/diagnostics14060632Search in Google Scholar PubMed PubMed Central

[9] Gunduz H. An efficient dimensionality reduction method using filter-based feature selection and variational autoencoders on Parkinson’s disease classification. Biomed Signal Process Control. 2021;66(4):102452.10.1016/j.bspc.2021.102452Search in Google Scholar

[10] Toups BS, Thomson RC, Brown JM. Complex models of sequence evolution improve fit, but not gene tree discordance, for tetrapod mitogenomes. Syst Biol. 2025;74(1):86–100.10.1093/sysbio/syae056Search in Google Scholar PubMed

[11] Piñero J, Saüch J, Sanz F, Furlong LI. The DisGeNET cytoscape app: Exploring and visualizing disease genomics data. Comput Struct Biotechnol J. 2021;19(10):2960–7.10.1016/j.csbj.2021.05.015Search in Google Scholar PubMed PubMed Central

[12] Sarker IH, Hoque MM, Uddin MK, Alsanoosy T. Mobile data science and intelligent apps: concepts, ai-based modeling and research directions. Mobile Network Appl. 2021;26(10):285–303.10.1007/s11036-020-01650-zSearch in Google Scholar PubMed PubMed Central

[13] Szklarczyk D, Gable AL, Nastou KC, Lyon D, Kirsch R, Pyysalo S, et al. The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. 2021;49(1):D605–12.10.1093/nar/gkaa1074Search in Google Scholar PubMed PubMed Central

[14] Choi NG, DiNitto DM, Marti CN, Choi BY. Telehealth use among older adults during COVID-19: Associations with sociodemographic and health characteristics, technology device ownership, and technology learning. J Appl Gerontol. 2022;41(3):600–9.10.1177/07334648211047347Search in Google Scholar PubMed PubMed Central

[15] Fareed N, Swoboda CM, Jonnalagadda P, Huerta TR. Persistent digital divide in health-related internet use among cancer survivors: findings from the health information national trends survey, 2003–2018. J Cancer Survival. 2021;15:87–98.10.1007/s11764-020-00913-8Search in Google Scholar PubMed PubMed Central

[16] Touzé C, Vizzaccaro A, Thomas O. Model order reduction methods for geometrically nonlinear structures: a review of nonlinear techniques. Nonlinear Dyn. 2021;105(2):1141–90.10.1007/s11071-021-06693-9Search in Google Scholar

[17] Sadiq MT, Yu X, Yuan Z. Exploiting dimensionality reduction and neural network techniques for the development of expert brain–computer interfaces. Expert Syst Appl. 2021;164(2):114031.10.1016/j.eswa.2020.114031Search in Google Scholar

[18] Campos-Castillo C, Anthony D. Racial and ethnic differences in self-reported telehealth use during the COVID-19 pandemic: a secondary analysis of a US survey of internet users from late March. J Am Med Inf Assoc. 2021;28(1):119–25.10.1093/jamia/ocaa221Search in Google Scholar PubMed PubMed Central

[19] Devassy BM, George S, Nussbaum P. Unsupervised clustering of hyperspectral paper data using t-SNE. J Imaging. 2020;6(5):29–41.10.3390/jimaging6050029Search in Google Scholar PubMed PubMed Central

[20] Stockwell S, Stubbs B, Jackson SE, Fisher A, Yang L, Smith L. Internet use, social isolation and loneliness in older adults. Ageing Soc. 2021;41(12):2723–46.10.1017/S0144686X20000550Search in Google Scholar

[21] Wang Y, Huang H, Rudin C, Shaposhnik Y Understanding how dimension reduction tools work: an empirical approach to deciphering t-SNE, UMAP, TriMAP, and PaCMAP for data visualization. J Mach Learn Res. 2021;22(201):1–73.Search in Google Scholar

Received: 2025-02-06

Revised: 2025-04-11

Accepted: 2025-04-27

Published Online: 2025-08-08

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/nleng-2025-0141

Keywords for this article

high-dimensional data; visualization; nonlinear dimensionality reduction; correlation features; PCA algorithm

Creative Commons

BY 4.0