An Intelligent System for Identifying Influential Words in Real-Estate Classifieds

Sherief Abdallah

doi:10.1515/jisys-2016-0100

Article Open Access

An Intelligent System for Identifying Influential Words in Real-Estate Classifieds

Sherief Abdallah

Published/Copyright: November 12, 2016

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Intelligent Systems Volume 27 Issue 2

Abstract

This paper focuses on the problem of quantifying how certain words in a text affect, positively or negatively, some numeric signal. These words can lead to important decisions for significant applications such as E-commerce. For example, consider the corpus of real-estate classifieds, which we developed as a case study. Each classified has a description of a real-estate property, along with simple features such as the location and the number of bedrooms. The problem then is to identify which keywords influence the price of the property. Such identification is complicated due to the existence of simple features (numeric and nominal attributes) that also affect the price. In this research, we propose a two-stage regression model to solve this problem. To assess our contribution, we analyze, as a case study, four corpora of real-estate classifieds. The analysis shows that our model predicts the price of a real-estate unit more accurately using the accompanying text, compared to the prediction relying only on simple features. We also demonstrate the capability of our model to annotate (automatically) words that affect the price positively or negatively.

Keywords: Text mining; data mining; real estate; regression; predictive analytics

MSC 2010: 68T05 Learning and adaptive systems; 68U15 Text processing

1 Introduction

Regression is an important data-mining task, where the goal is to predict some numeric value associated with an object using the object features (also called attributes). The numeric value to be predicted is usually called the target attribute. For example, we may wish to predict the price of a real-estate property based on the property features such as the number of bedrooms, number of bathrooms, and its location. We may also wish to predict the probability that a patient would suffer from a certain complication based on the patient’s features, such as age, occupation, and the occurrence of the disease in family.

The majority of the previous work that studied the regression task focused on simple data, where all the attributes are numeric [9, 19, 33]. This paper focuses on regression tasks where the data has a mixture of simple attributes (either numeric or nominal) and textual attributes. We are particularly interested in the problem of quantifying how certain words affect, positively or negatively, the regression task. For example, consider the corpus of real-estate classifieds, which we use as our case study. In this corpus, an object represents a real-estate property and has mix of simple features (such as “number_of_bedrooms”) and textual features (such as “title” and “description”). The problem then is to identify which keywords influence the price of the property. As another example, consider a dataset of patient data, where an object in this case represent a patient and again has a mix of simple features (such as “age” and “gender”) and textual features (such as “physician report”). The goal in this case is to identify which keywords in the “physician report” influence the risk of some disease.

In order to identify which keywords affect the regression positively or negatively, we need to be able to isolate the effect of the textual features from the effect of the remaining simple features. We propose in this paper a two-stage regression model. The first stage of our model uses only simple features of an object to make an initial prediction of the target attribute. The second stage uses the textual features of the same object to refine the initial prediction. For example, consider the real-estate classifieds data where the goal is to predict the price of a real-estate unit. The initial stage would learn to predict the price using only the simple features such as the size of the unit and its location. The second stage will learn to refine the initial prediction using features from the classified text. The decomposition of the regression into two (sequential) stages achieves the following benefits.

Different regression model for each stage: This is important because while linear regression (LR) is convenient to use for the second stage to highlight which keywords are (not) important, LR is not necessarily the best regression model for Stage 1 (which uses simple features).

Intuitive semantics: With such decomposition, the semantics of the LR weights for the second stage are clear: explain, using textual features, why the actual value of the target attribute for a given object is lower than (or larger than) the expected value, as predicted by the regression model in Stage 1 (which relied only on simple attributes).

We conducted the analysis on four corpora of real-estate classifieds. ^[1] We show that our proposed approach improves the prediction of a property price, compared to simple features alone. We also illustrate how the two-stage model can be used to identify keywords that affect the price negatively or positively. Table 1 shows an example classified that was automatically highlighted (annotated) using our proposed system. For instance, the words and represent two neighborhoods (in Dubai, United Arab Emirates), with the being the much more prime location. Also, the words are examples of words that affect the price positively as they reflect desirable features.

Table 1:

Sample Classified with Important Words Highlighted.

To summarize, the contributions of this paper are as follows:

Proposing a two-stage regression model that exploits the textual features to improve the prediction when the data contain a mix of simple and textual attributes.
Showing how our proposed model can be used to annotate and highlight keywords that affect the prediction positively or negatively (therefore producing the first system for automatically recognizing and highlighting important textual terms in real-estate classifieds).
Collecting four corpora of real-estate classifieds and using the four datasets to evaluate our proposed model against three different regression models.

The rest of the paper is organized as follows. The next section explains our two-stage regression model and illustrates how it can be used for identifying important keywords. We then describe the domain we use as our case study, real-estate classifieds. This is followed by the evaluation and analysis of our proposed approach using the collected data. We then discuss the related work and conclude our paper.

2 Two-Stage Regression Model with Text Mining

Figure 1 illustrates the process we propose to analyze mixed data. The main idea is to use a two-stage regression model. The first stage (Stage 1 regression model in the figure) attempts to predict the target attribute using only simple features. The second stage attempts to predict the remaining difference between the actual value of the target attribute and the initial prediction from Stage 1 using textual features.

Figure 1:

The Data Mining Process We Used to Build the Two-Stage Regression Model.

The first stage attempts to predict the target attribute using only the simple features. The second stage attempts to predict the difference between the actual value of the target attribute and the initial prediction from Stage 1.

The following subsections describe the different components of our proposed approach in further detail. ^[2]

2.1 Stage 1: Regression Using Simple Features

The first stage in our model uses simple features to predict the target attribute. The first step is preprocessing the data, which is dependent on the particular dataset. We describe the preprocessing we used for the real-estate data in Section 3. The second step of Stage 1 is building a regression model using the simple features. We have considered three regression models in our evaluation that are commonly used: LR, artificial neural networks (ANNs) [13], and support vector machine regression (SVMR) [26]. We will describe LR in further detail when we explain Stage 2 in a following section, because it is important to understand how we highlight important keywords. However, for the other two regression models, we refer the interested reader to the corresponding citations.

2.2 Regression Decomposition

Before building the model for Stage 2, the predicted target attribute in Stage 1 is calculated using the fitted model of Stage 1. The difference between the actual value of the target attribute and the value predicted by Stage 1 is calculated. This difference becomes the target attribute for Stage 2.

2.3 Stage 2: Using Text Mining to Improve Prediction

The purpose of applying text mining in our model is to discover the effect of the hidden information in the textual features. The first step in Stage 2 is tokenizing the text by splitting the text into sequence of tokens or words. Tokens that are less than four characters in length are removed (such as “a” and “is”). Then, stop-word tokens are removed using a list of English stop-words. A stop-word is a word that has little distinguishing power, such as “they” and “from”. This is followed by the generation of n-Gram terms. An n-Gram term is a series of successive n tokens. For example, the set of 1-Gram terms is the set the original tokens. The set of 2-Gram terms contains sequences of two tokens and so forth.

To avoid terms that are too rare (possibly outliers) or too common (does not help in distinguishing objects), all terms that appear in less than T_min or more than T_max of all objects are removed. The final step is computing the term frequency-inverse document frequency (TF-IDF) [34] for each term (in each object). The TF-IDF counts how many times a particular term (n-Gram) i appears in the text of object j, which is then inversely weighted by how common the term is across different objects:

TF-IDF(i, j)=TF(i, j)∗IDF(i),

where TF(i, j)=F(i, j)|F|,F(i, j) is the number of occurrences of term i in object j (which is a classified in our case study), |F| is the length of the F vector (F(i,j) ∀i), IDF(i)=logNNi,N is the number of objects (classifieds), and N_i is the number of objects that contain term i.

By the end of this step, the text of each object is converted to a numerical vector representation. Each feature in the vector corresponds to a term, where the value is the TF-IDF of the term. The features are then filtered to reduce their numbers. We removed features that are highly correlated with one another (keeping only one feature from each set of correlated features). Two features are considered correlated if the correlation between them exceeds 0.99. The final step of Stage 2 is building the regression model. Here, we use the LR model because it assigns a weight for each term. LR follows the simple model for predicting the target attribute of an object j:

Predicted value (j)=∑iwi.aij,

where w_i is the weight corresponding to feature i and aij is the value of feature i for object j. The learning algorithm finds the best values for the weights that minimize the (squared) error (between the actual value of the target attribute and the predicted value). The weight, therefore, reflects the feature’s contribution to the prediction. For example, when a feature has a positive weight, it means that the feature works toward increasing the predicted value. Similarly, having a feature with negative weight has the effect of decreasing the predicted value. We show in the following section how this intuition helps in identifying important keywords.

2.4 Applying the Model

Figure 2 illustrates how the generated two-stage regression model can be used to predict the value of the target attribute. Each stage predicts a component of the value, and the two components are summed to generate the final prediction.

Figure 2:

The Data Mining Process to Apply the Two-Stage Regression Model.

Each stage predicts a component of the value, and the two components are summed to generate the final prediction.

As we also show in Table 1, the weights of the LR model of Stage 2 can be used to highlight important keywords. Each word is assigned three color components – red, green, and blue. By default, all color components is assigned the value of zero (black color). If a word has a non-zero weight in the LR model of Stage 2, then the word color is changed as follows. If the weight is positive, then the weight is assigned to the blue color component. If the weight is negative, then the absolute value of the weight is assigned to the red color component.

3 Case Study: Real-Estate Classifieds

Real-estate classifieds constitute an integral part of the real-estate market. A real-estate classified provides a concise description of a real-estate unit that is available either for rent or sale. Traditionally, such classifieds were publicized using printed materials such as newspapers or dedicated classifieds periodicals. The Internet revolutionized the classifieds business, as with other advertisement sectors. The large reduction in price, coupled with efficient and quick search service, removed the restriction on the size of classifieds (from a maximum of few tens of words to practically unlimited textual description) and allowed the inclusion of multimedia images and videos. This revolution can be verified through the valuation of Zillow, one of the biggest websites that specializes in posting real-estate classifieds, at 50 billion USD. ^[3] Although at some point, it was feared that such a new trend in advertising properties may threaten the profitability of traditional brokerage companies, many brokers have adapted their operation to exploit what the web has to offer, and they integrated the web technology into their process [7].

Nowadays, many large brokerage companies develop their own websites to list their properties, in addition to listing their properties on third-party web portals. In the United Arab Emirates, where the real-estate market constitutes 12.5% of the GDP, ^[4] several major website portals target real-estate classifieds. Examples include Gulf News Ads (www.gnads4u.com), Dubizzle (www.dubbizle.com), Bayut (www.bayut.com), and Property Finder (www.propertyfinder.ae). Furthermore, major real-estate brokering companies list classifieds on their own websites, such as Better Homes (www.bhomes.com) and Hamptons (www.hamptons.ae).

Despite the importance of real-estate classifieds, there has been little work that analyzed such data. In particular, most of the previous work (that applied data-mining techniques to real-estate data) focused on simple attributes such as number of bedrooms, area, location, etc. [12] (a broader view of related work is given in Section 5). For example, Zillow website provides Zestimate service, which attempts to automatically valuate a real-estate unit. The service relies on history of previously sold houses, using only simple features [16].

However, for classifieds, the unstructured and ungrammatical ^[5] textual attributes (the classified title and description) are important components of a classified that should not be ignored as they may encapsulate less common features (such as “remodeled house”, “near X hotel”). While many websites try to capture less common and ad hoc features in a form of a checklist that ad posters can mark, it is hard to account for all possible features in such manner, particularly rare features. Also, some ad hoc features may appear suddenly over time (e.g. due to the opening of a new park or a hotel).

Two key problems are faced by humans posting real-estate classifieds:

What is the fair price of a real-estate unit that should be advertised in a classified?
How to write an effective title and description of a classified?

Answering these questions will be valuable for real-estate brokers and home owners who directly post classifieds. Predicting a more accurate price, for a real-estate unit, that takes into account the textual unstructured data (not just the structured data) will prevent the stakeholder from overestimating or underestimating the price. Furthermore, by identifying the important keywords, the stakeholder can refine the unit description and title to better reflect the price being asked.

We extracted our data from major websites that post online residential real-estate classifieds in the United Arab Emirates. The data was collected ^[6] over the period from 17th of September to 6th of October 2015, using our own web crawler. ^[7] The collected data contained both apartments (flats) and villas (houses) that are offered for either sale or rent. A total of 20,600 records were extracted. Table 2 illustrates the extracted features.

Table 2:

Features of the Real-Estate Classifieds Data Set.

Name	Type	Description
Title	Text	Title of the classified.
Description	Text	Description of the property.
Beds	Integer	Number of bedrooms in a property
Baths	Integer	Number of bedrooms in a property
Size	Integer	Built area of the property (in square feet)
Location	Nominal	Each nominal value represents a neighborhood
Price	Integer	The renting or selling price of the property

The extracted data was then cleaned as follows. Duplicate classifieds (identical title and description) were removed to end up with 19,644 records. Then, outliers were removed as follows. First, the average rent/price per bedroom was calculated by dividing Price over Beds+1 (the added one was needed in case of a studio apartment that has value Beds=0). Records with unreasonably low/high price were removed according to the following conditions. For the rental properties, we only accepted the properties that satisfy the condition 100,000≥Average rent≥10,000. ^[8] The properties that are up for sale needed to satisfy the condition 1,000,000≥Average price ≥100,000. This resulted in a further reduction in the number of records to 16,008 divided as follows:

Apartments for rent: 8192 records;
Houses for rent: 2105 records;
Apartments for sale: 4599 records;
Houses for sale: 1112 records.

Then, the location attribute, which is nominal, was converted to numerical attributes. This was done by creating a set of binary attributes, one binary attribute for each nominal value of the location. For example, “Dubai Marina” is a possible value of the location attribute. After this step, we have a binary feature “loc_dubai_marina”, which equals 1 if and only if the location attribute equals “Dubai Marina”; otherwise, it equals 0. Finally, all textual features were converted to lowercase.

4 Analysis

To evaluate our approach, we computed two performance metrics: the root mean squared error (RMSE), and the correlation between the predicted price and the actual price. These two measures were computed (for each of the four datasets) using 10-fold cross-validation to compare our two-stage regression model (which incorporated text mining) against the traditional (one-stage) regression model (which relied only on simple features). The comparison was done using three different regression models for the first stage, while LR was used for the second stage. For simplicity, only 1-Gram was used, with T_max=30% and T_min=3%. The list of English stop-words are provided by RapidMiner. ^[9]

Figure 3 summarizes our results for the four datasets and the three regression models of Stage 1. In all cases, the RMSE of the two-stage model is lower than the RMSE of the one-stage model. Furthermore, and except for the case of using LR in the Selling House data set, the confidence interval of the two-stage model does not include the mean RMSE of the one-stage model. In the remainder of this section, we discuss in more detail the results achieved when different regression models are used for Stage 1.

Figure 3:

Summary of Results for the Four Data Sets (Selling House, Selling Apartment, Renting House, Renting Apartment) and the Three Regression Models of Stage 1 (LR, ANN, and SVMR).

Our approach (two-stage model) achieved lower (better) RMSE across all data sets, regardless of the underlying regression model in Stage 1.

Tables 3 and 4 summarize the results when LR was used for the first stage. The RMSE is reduced and the correlation increased. It is worth noting that the RMSE is higher and the correlation is lesser for houses when compared to apartments. This is due to the larger variety of houses even within the same neighborhood, in addition to the smaller number of house classifieds (compared to apartments). Furthermore, the RMSE for selling (an apartment or a house) is significantly higher than the RMSE for renting. The RMSE is sensitive to the scale of the target attribute we are trying to predict. The annual rent is much less than the selling price, hence the significant difference in RMSE, regardless of the dataset size. For example, a one bedroom may have annual rent of 80,000 AED while being sold for 1,000,000 AED or more. Nevertheless, RMSE is suitable for comparing the relative performance of different techniques on the same dataset (rather than across datasets), which is our goal.

Table 3:

RMSE When LR Was Used in Stage 1 of Our Two-Stage Regression Model.

Dataset	RMSE w/o text mining	RMSE with text mining
Renting apartment	36,737±1046	32,047.045±1352.944
Renting house	61,752±2814	57,687.191±3176.731
Selling apartment	465,525±8399	415,685.369±18,292.775
Selling house	735,377±29,075	727,541.031±78,869.986

Table 4:

Correlation (between Predicted and Actual Price) When LR Was Used in the First Stage of Our Two-Stage Regression Model.

Dataset	Corr. w/o text mining	Corr. with text mining
Renting apartment	0.838±0.012	0.881±0.008
Renting house	0.845±0.021	0.871±0.017
Selling apartment	0.856±0.013	0.889±0.011
Selling house	0.809±0.038	0.817±0.051

Tables 5 and 6 show similar results when we used ANN [13] for Stage 1. The tables show again clear and statistically significant improvements when using our two-stage approach against the traditional one-stage approach (that also uses ANN). The RMSE is reduced, and the correlation increased in all four datasets. It is worth noting that across all datasets, the performance of ANN is worse than LR; however, the use of text mining in Stage 2 still improved performance.

Table 5:

RMSE when ANN Regression Was Used in the First Stage of Our Two-Stage Regression Model.

Dataset	RMSE w/o text mining	RMSE with text mining
Renting apartment	44,648.549±6795.781	34,843.054±5676.464
Renting house	85,353.454±26,446.908	69,535.376±6847.498
Selling apartment	579,014.731±120,081.461	443,262.976±24,856.718
Selling house	1,193,741.188±568,449.993	920,461.846±144,612.999

Table 6:

Correlation (between Predicted and Actual Price) When ANN Regression Was Used in the First Stage of Our Two-Stage Regression Model.

Dataset	Corr. w/o text mining	Corr. with text mining
Renting apartment	0.811±0.034	0.871±0.027
Renting house	0.714±0.244	0.816±0.040
Selling apartment	0.835±0.024	0.878±0.016
Selling house	0.490±0.339	0.684±0.110

Tables 7 and 8 show similar results when we used SVMR [26] for the first stage of our approach. Similar to the previous regression models we evaluated, the tables show clear improvements when using our two-stage approach against the traditional one-stage approach (that also uses SVMR). Interestingly, using the traditional one-stage approach, SVMR is slightly worse than LR. However, using our two-stage approach and exploiting the textual features of ads, SVMR results in the lowest RMSE in all four datasets with a clear advantage over LR for the Selling House dataset.

Table 7:

RMSE When SVMR Was Used in the First Stage of our Two-Stage Regression Model.

Dataset	RMSE w/o text mining	RMSE with text mining
Renting apartment	37,550.226±1814.974	32,003.126±1676.801
Renting house	62,682.050±4487.318	57,377.165±3531.160
Selling apartment	467,869.878±13,612.336	415,388.299±18,916.617
Selling house	746,897.386±46,691.263	688,058.699±66,977.813

Table 8:

Correlation (between Predicted and Actual Price) when SVMR Was Used in the First Stage of Our Two-Stage Regression Model.

Dataset	Corr. w/o text mining	Corr. with text mining
Renting apartment	0.833±0.014	0.880±0.009
Renting house	0.843±0.02	0.871±0.018
Selling apartment	0.855±0.013	0.889±0.011
Selling house	0.802±0.038	0.829±0.050

To understand why text mining reduced the RMSE, we investigated the LR model in Stage 2 to identify which words affected the price positively or negatively. Table 9 lists a sample of the discovered important words. Few of the words that affected the price were related to the location. While the location attribute in the original dataset ^[10] did specify the location of the unit in a structured manner, only one location was allowed. Textual description, on the other hand, frequently mention nearby prime locations and landmarks. For example, the Palm islands and Downtown are prime locations in Dubai, and therefore affect the price positively in several of the datasets. The Burj refers to Burj Khalifa, the tallest artificial structure in the world, which is a prime landmark with luxurious real estate surrounding it. The Sports city is a new real-estate project that is not fully developed yet (hence the reduction in rent). Most of the words that affected the price negatively made sense, such as deal, offer, road, price, and sale. Interestingly, some of the words affected the price negatively despite representing positive sentiment, such as spacious and nice. This is in contrast to words like amazing, stunning, and beautiful that also represent positive sentiment but affected the price positively. A possible explanation is that words with strong positive sentiment reflect strong confidence in the worthiness of the property and therefore affect the price positively. On the other hand, words with weak positive sentiment reflect weaker confidence in the actual worth of the real-estate property and hence affect the price negatively.

Table 9:

Sample of Words (Uni-gram and Bi-gram) that Affect the Price of a Real-Estate Classified (Either Positively or Negatively).

Dataset	Positive words	Negative words
Renting apartment	Study, suite	Sale, sports
	Palm, amazing	Nice, deal
Renting house	Downtown, palm	Partial, deal
	Luxurious, stunning	Offer, village
Selling apartment	Burj, quality	Road, covered
	Study, beautiful	Cluster, plot
Selling house	Proud, finishing	Hotel, fully
	Golf, views	Price, spacious

5 Related Work

A variety of machine-learning techniques were proposed to predict numerical (continuous) variables such as cost and time. For example, support vector regression [33] and neural networks [30] were used to predict travel time in transportation. Case-based reasoning was used to estimate the cost of road pavement projects [6]. Neural networks and support vector machines were used to predict electricity prices [32]. All the above approaches, however, focused on numerical attributes and did not handle textual features and only used a single-stage model. While an early regression model did support mixture of numeric and categorical attributes [8], the model assumed fixed uniform weights for all categorical attributes (unlike our proposed approach). Unlike regression, the focus of our work, there has been recent work in the classification task that handled mixed data (of categorical and numerical attributes) [23]. This includes an interesting recent work that proposed support-vector-oriented instance selection to improve text classification [29]. Another recent work used text mining to analyze and classify code structures in Android malware [28].

Our proposed two-stage model can be viewed as an ensemble regression technique. Ensemble learning is a process that builds a set of models derived from the same dataset and then integrates these models in some way to obtain the final prediction [19, 21]. However, unlike our proposed model, previous work in ensemble regression assumed only one type of models in any ensemble. For example, the ensemble contains only decision trees. The restriction of one model type per ensemble was necessary to allow the generation of an unbounded number of models. Our two-stage model, on the other hand, allows the model in Stage 1 to be completely different from the model in Stage 2. Furthermore, the vast majority of ensemble methods obtained different fitted models (of the same type) by either training different samples of the data or focusing on different features [19]. Our two-stage model uses different target outputs for the two regression models. An ensemble method that is worth mentioning is iterative bagging [3]. In iterative bagging, the target output of an iteration is the error of the previous iteration (the difference between the target output and the prediction of the previous iteration). This is similar to the manner we devise the target output of Stage 2. However, iterative bagging used the same features and the same model type for all iterations. In contrast, our model uses different set of features and a different model for each stage.

Due to the importance of valuating a real-estate property, there has been extensive research on automatic valuation [2, 4, 15, 18, 24, 25]. Most of that work used hedonic models, which assumed the price can be predicted from a combination (usually linear) of the property’s structured features [10], such as number of bedrooms and the location. Traditional hedonic models were based on human expertise, where the model parameters were hand-coded by experts, unlike our proposed method here, which uses data mining. There has been growing literature on the use of data-mining techniques to analyze real-estate data [5, 11, 12, 14, 17]; however, most of the previous work focused only on simple features and ignored textual features. We review sample of these works in the remainder of this section.

One of the early works [17] used decision tree and neural network techniques to predict the sale price of a house. The analysis used data with 15 numerical features that represent the characteristics of houses plus a categorical feature that corresponds to the address. The dataset consisted of 1000 records that were collected from the house sales transactions in Miami, United States. Unlike our work, the analysis focused only on properties for sale (did not include rentals), used only simple features (no text mining), and relied on a much smaller dataset (compared to our 15,000+ records). A broader analysis was conducted in Ref. [31], covering 295,787 transactions from four cities in the United States. Again, only numerical features were used (although a more extensive set of almost 200 features) and no textual features were used (also despite attempting to predict the price, no performance criterion was reported). A more recent work [12] proposed ANFIS (Adaptive Neuro Fuzzy Inference System) and tested the system using 360 records of past sales properties in Midwest United States. The dataset had 14 numerical features, and again no textual feature was used.

Another research paper [11] focused on studying the prediction of prices of apartments in a city in Macedonia. Among the three data-mining techniques that were applied to a dataset of 1200 sales transactions, the logistic regression (very similar to LR) was found to be the superior in prediction accuracy over decision tree and neural network techniques. Like the other earlier mentioned papers, there was no use of textual data. Some work attempted to add structure to unstructured and ungrammatical data. However, the work required domain knowledge to build a reference structure (model) that can be used to extract the corresponding features [20]. The work also was not applied to real-estate data. Our proposed approach does not require deep domain knowledge (aside from simple data cleansing, the whole process is automated).

The only work we know of that analyzed real-estate classifieds is very recent [27]. That work used textual features along with simple features in one stage and therefore did not offer the flexibility and the semantic meaning as our two-stage model. The previous work also did not provide a clear method for automatically highlighting important words.

6 Conclusion and Future Work

We proposed in this paper a two-stage regression model that uses text mining to analyze data that has a mix of simple and textual attributes. We showed how our proposed model can be used to highlight keywords that affect the prediction positively or negatively. To verify the approach, we used real-estate classifieds as a case study and collected four datasets. Three different regression models were evaluated against our two-stage regression model. The results confirmed that our model reduced the RMSE of price prediction (compared to one-stage models that use only simple features) across the four data sets and against the three different regression models. Furthermore, we introduced the first system for automatically recognizing and highlighting important textual terms in real-estate classifieds.

There are several interesting future directions that we are pursuing. We are currently working on extending our analysis to other datasets. This includes real-estate data from other continents (Europe and America). This also includes medical diagnosis datasets, where our approach can identify important keywords (in the textual diagnosis) that are correlated with the severity of some disease. We are also considering the integration of our system with a named-entity recognition component [1], particularly for identifying locations in ungrammatical text (to improve accuracy).

Acknowledgment

This work was supported in part by the British University in Dubai grant INF017.

Bibliography

[1] S. Abdallah, K. F. Shaalan and M. Shoaib, Integrating rule-based system with classification for Arabic named entity recognition, in: Computational Linguistics and Intelligent Text Processing – 13th International Conference, CICLing 2012, New Delhi, India, March 11–17, 2012, Proceedings, Part I, pp. 311–322, 2012.10.1007/978-3-642-28604-9_26Search in Google Scholar

[2] S. C. Bourassa, E. Cantoni and M. Hoesli, Predicting house prices with spatial dependence: a comparison of alternative methods, J. Real Estate Res.32 (2010), 139–159.10.1080/10835547.2010.12091276Search in Google Scholar

[3] L. Breiman, Using iterated bagging to debias regressions, Mach. Learn.45 (2001), 261–277 (English).10.1023/A:1017934522171Search in Google Scholar

[4] B. Case, J. Clapp, R. Dubin and M. Rodriguez, Modeling spatial and temporal house price patterns: a comparison of four models, J. Real Estate Finance Econ.29 (2004), 167–191.10.1023/B:REAL.0000035309.60607.53Search in Google Scholar

[5] T. H. Chen and C. W. Chen, Application of data mining to the spatial heterogeneity of foreclosed mortgages, Expert Syst. Appl.37 (2010), 993–997.10.1016/j.eswa.2009.05.076Search in Google Scholar

[6] J. S. Chou, Web-based CBR system applied to early cost budgeting for pavement maintenance project, Expert Syst. Appl.36 (2009), 2947–2960.10.1016/j.eswa.2008.01.025Search in Google Scholar

[7] K. Crowston and R. T. Wigand, Real estate war in cyberspace: an emerging electronic market?, Int. J. Electron. Markets9 (1999), 1–8.10.1080/101967899359229Search in Google Scholar

[8] C. M. Cuadras and C. Arenas, A distance based regression model for prediction with mixed data, Commun. Stat. Theor. Methods19 (1990), 2261–2279.10.1080/03610929008830319Search in Google Scholar

[9] L. Duan and L. D. Xu, Business intelligence for enterprise systems: a survey, IEEE Trans. Indust. Inform.8 (2012), 679–687.10.1109/TII.2012.2188804Search in Google Scholar

[10] J. Frew and G. Jud, Estimating the value of apartment buildings, J. Real Estate Res.25 (2003), 77–86.10.1080/10835547.2003.12091101Search in Google Scholar

[11] Z. Gacovski, J. Kolic, R. Dukova and M. Markovski, Data mining application for real estate valuation in the city of Skopje, in: ICT Innovations 2012, Web Proceedings ISSN 1857–7288 (2012), 537–538.Search in Google Scholar

[12] J. Guan, J. Zurada and A. S. Levitan, An adaptive neuro-fuzzy inference system based approach to real estate property assessment, J. Real Estate Res.30 (2008), 395–422.10.1080/10835547.2008.12091225Search in Google Scholar

[13] L. K. Hansen and P. Salamon, Neural network ensembles, IEEE Trans. Pattern Anal. Mach. Intell.12 (1990), 993–1001.10.1109/34.58871Search in Google Scholar

[14] M. Helbich, W. Brunauer, J. Hagenauer and M. Leitner, Data-driven regionalization of housing markets, Ann. Assoc. Am. Geogr.103 (2013), 871–889.10.1080/00045608.2012.707587Search in Google Scholar

[15] M. Helbich, A. Jochem, W. Mücke and B. Höfle, Boosting the predictive accuracy of urban hedonic house price models through airborne laser scanning, Comput. Environ. Urban Syst.39 (2013), 81–92.10.1016/j.compenvurbsys.2013.01.001Search in Google Scholar

[16] S. B. Humphries, D. Xiang, J. L. Burstein, Y. Bun and J. A. Ultis, Automatically determining a current value for a home, March 20 2012, US Patent 8,140,421.Search in Google Scholar

[17] R. D. Jaen, Data mining: an empirical application in real estate valuation, in: Proceedings of the Fifteenth International Florida Artificial Intelligence Research Society Conference (FLAIRS), S. M. Haller, G. Simmons, eds., pp. 314–317, AAAI Press, 2002.Search in Google Scholar

[18] S. McGreal and P. T. de La Paz, An analysis of factors influencing accuracy in the valuation of residential properties in Spain, J. Property Res.29 (2012), 1–24.10.1080/09599916.2011.589531Search in Google Scholar

[19] J. Mendes-Moreira, C. Soares, A. M. Jorge and J. F. De Sousa, Ensemble approaches for regression: a survey, ACM Comput. Surv. (CSUR)45 (2012), 10.10.1145/2379776.2379786Search in Google Scholar

[20] M. Michelson and C. A. Knoblock, Creating relational data from unstructured and ungrammatical data sources, J. Artif. Intell. Res. (JAIR)31 (2008), 543–590.10.1613/jair.2409Search in Google Scholar

[21] D. Opitz and R. Maclin, Popular ensemble methods: an empirical study, J. Artif. Intell. Res.11 (1999), 169–198.10.1613/jair.614Search in Google Scholar

[22] M. S. Pera, R. Qumsiyeh and Y. K. Ng, Web-based closed-domain data extraction on online advertisements, Inform. Syst.38 (2013), 183–197.10.1016/j.is.2012.07.006Search in Google Scholar

[23] M. T. Rezvan, A. Z. Hamadani and A. Shalbafzadeh, Case-based reasoning for classification in the mixed data sets employing the compound distance methods, Eng. Appl. Artif. Intell.26 (2013), 2001–2009.10.1016/j.engappai.2013.07.014Search in Google Scholar

[24] P. Rossini, Accuracy issues for automated and artificial intelligent residential valuation systems, in: Proceedings of the International Real Estate Society Conference, Kuala Lumpur, 26–30 January, 1999.Search in Google Scholar

[25] P. Rossini, Using expert systems and artificial intelligence for real estate forecasting, in: Sixth Annual Pacific-Rim Real Estate Society Conference, Sydney, Australia, pp. 24–27, 2000.Search in Google Scholar

[26] S. K. Shevade, S. S. Keerthi, C. Bhattacharyya and K. R. K. Murthy, Improvements to the SMO algorithm for SVM regression, IEEE Trans. Neural Netw.11 (2000), 1188–1193.10.1109/72.870050Search in Google Scholar PubMed

[27] D. Stevens, Predicting real estate price using text mining, Master’s thesis, Tilburg University School of Humanities, the Netherlands, 2014.Search in Google Scholar

[28] G. Suarez-Tangil, J. E. Tapiador, P. Peris-Lopez and J. Blasco, Dendroid: a text mining approach to analyzing and classifying code structures in Android malware families, Expert Syst. Appl.41 (2014), 1104–1117.10.1016/j.eswa.2013.07.106Search in Google Scholar

[29] C. F. Tsai and C. W. Chang, SVOIS: support vector oriented instance selection for text classification, Inform. Syst.38 (2013), 1070–1083.10.1016/j.is.2013.05.001Search in Google Scholar

[30] J. W. C. Van Lint, S. P. Hoogendoorn and H. J. van Zuylen, Accurate freeway travel time prediction with state-space neural networks under missing data, Transport. Res. Pt. C Emerg. Technol.13 (2005), 347–369.10.1016/j.trc.2005.03.001Search in Google Scholar

[31] W. Wedyawati and M. Lu, Mining real estate listings using ORACLE data warehousing and predictive regression, in: Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, IRI, Las Vegas, Nevada, USA, pp. 296–301, 2004.Search in Google Scholar

[32] R. Weron, Electricity price forecasting: a review of the state-of-the-art with a look into the future, Int. J. Forecast.30 (2014), 1030–1081.10.1016/j.ijforecast.2014.08.008Search in Google Scholar

[33] C. H. Wu, J. M. Ho and D. T. Lee, Travel-time prediction with support vector regression, IEEE Trans. Intell. Transport. Syst.5 (2004), 276–281.10.1109/TITS.2004.837813Search in Google Scholar

[34] H. C. Wu, R. W. P. Luk, K. F. Wong and K. L. Kwok, Interpreting TF-IDF term weights as making relevance decisions, ACM Trans. Inform. Syst. (TOIS)26 (2008), 13.10.1145/1361684.1361686Search in Google Scholar

Received: 2016-7-4

Published Online: 2016-11-12

Published in Print: 2018-3-28

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Articles in the same Issue

https://doi.org/10.1515/jisys-2016-0100

Keywords for this article

Text mining; data mining; real estate; regression; predictive analytics

Creative Commons

BY-NC-ND 3.0