Some Things Considered

Autism Social Media Analysis

2016-10-18T00:00:00+00:00

I haven’t posted here recently, partially due to work that I’ve been doing with social media analysis applied to autism solution providers. A blog related to that work can be found here.

Insights Blog: Realtor Comments and Property Condition

2016-10-17T00:00:00+00:00

Back in July, added an entry to CoreLogic’s Insights blog looking at realtor comments for distressed listings.

I made a few changes compared to the earlier post, such as adding zip level control variables within the county. In the previous post, words associated with neighborhoods dominated the bag of words model. With the inclusion of zip dummy variables, the text variables were more likely to pick up property condition or amenity effects.

Insights Blog: Realtor Comments and Hedonic Regression

2016-01-18T00:00:00+00:00

I recently added an entry to CoreLogic’s Insights blog which examined realtor comments and property listings. The analysis was fairly high level, but provided an introduction regarding how information contained in listing agent comments could improve house price estimation via hedonic regression. Specifically, I regressed the log of house price against the number of bedrooms, bathrooms, living space, and a bag of words based on the realtor comments. More information can be found on CoreLogic’s Insights blog page.

The R code estimating the regression (using glmnet) and generating word clouds of the most important words can be found here.

I also created code to identify listings containing a specific word, and then print the entire realtor comment containing the word. This was useful in examining the context in which a word was used (i.e. “swamp” was used describe a cooling method– swamp coolers). Code can be found here.

Insights Blog: Property Tax Overview

2016-01-01T00:00:00+00:00

I recently contributed an entry to CoreLogic’s Insights blog. The blog entry provided a brief overview of property taxes and summarized national property tax delinquency trends over time. The entry wasn’t overly technical, but property tax delinquency is an interesting topic for analysis. We’ve found that property tax delinquency can be used to predict future mortgage delinquency. An inability to pay property taxes can indicate financial stress by the homeowner, which could carry over to difficulty meeting mortgage payment obligations. However, it is important to account for the fact that property tax penalties vary by municipality. Tax penalties include interest on delinquent taxes and, eventually, possible foreclosure on the property. In some cases, the interest penalty may be very low and the time period until foreclosure is initiated may be very long. In such instances, homeowners may strategically elect to enter into some period of delinquency (for example, investors who choose to use funds elsewhere). In such cases, the correlation between tax delinquency and future mortgage delinquency would be weak. Analysis that examines the link between tax and mortgage delinquency should account for municipality specific penalty factors.

Social Network Analysis

2015-12-21T00:00:00+00:00

This page contains an interesting application of social network analysis applied to characters in the Star Wars movies. I’ve had limited exposure to social network analysis, but I think that an interesting application could be examining home appraisal data. When estimating the value of a home that is for sale, appraisers select nearby recent sales to serve as points of reference. Data related to nearby homes used as comps to a subject property for sale could be used to evaluate how the homes are linked. This could provide information on things like neighborhood boundaries. Appraisers who select the comparable properties are familiar with the neighborhood, and may not select properties that are geographically close, but separated by other factors (e.g. school district, railroad tracks etc.). An examination of properties that are geographically close but not closely related in terms of network analysis could provide insight into neighborhood boundaries. This is something that I’d like to examine in the upcoming year.

CoreLogic Blogging in 2016

2015-12-11T00:00:00+00:00

One of my goals in 2016 it to contribute on a regular basis to the economic and real estate blog at work. The company doesn’t allocate much time for independent research, but it’s possible to write blog posts that relate to analysis performed as part of the job. For example, I’ve drafted a few posts related to property tax delinquency rates and mortgage servicer performance, which should be published in early 2016. It’s also possible to use ‘nights and weekends’ to write something that is less directly tied to my ‘day job’. Given CoreLogic’s wealth of data, there are a lot of interesting questions that can be investigated. One of the first things that I’d like to write about is using realtor comments in MLS data to improve home price estimation. Hopefully, I will be able to publish a few blogs on that topic in early 2016. There are several papers that have examined the topic, but it would be interesting to write code to gain hands-on experience and to see how my results compare to the existing literature.

KDD Cup 2015

2015-12-07T00:00:00+00:00

Analytic competitions, such those hosted by Kaggle, provide opportunities to test and compare different algorithms for building predictive models. A few months ago, I participated in the 2015 KDD Cup with a colleague at CoreLogic.

The KDD Cup is a competition associated with the annual Knowledge Discovery and Data Mining conference. The topic of the 2015 KDD Cup was predicting dropouts in Massive Open Online Courses (MOOC). Dropout rates in MOOCs are very high, and numerous research papers have investigated reasons for this behavior.

Data were provided from XuetangX, a Chinese MOOC learning platform. The outcome metric was course dropout during the next 10 days. Information was provided about the course and student activity over time. Student information included a record over time of participation in various aspects of the course (discussion forum, quiz etc). A student ID was provided and could be used to link records for a given student across courses. This could be used to calculate metrics such as student-level completion rates across courses as model inputs.

The analytic team at CoreLogic had previously used the Kaggle bike sharing demand competition as a team-building exercise. In that exercise, I had taken the approach of segmenting the data (by casual vs registered bike user, workday vs non workday, and season) and then estimating a negative binomial regression for each segment (using the glm.nb function in R). My colleague used a generalized boosted regression in his approach. Blending our models to predict bike demand provided superior results than either of the individual approaches.

Given our favorable results in the bike sharing contest, we decided to take a similar approach in the KDD Cup 2015 contest. My colleague once again used GBM to estimate his models. My approach was to segment the data, and use logistic regression to estimate the dropout probability. I segmented the data by course, which removed any course-level effects in the estimation process (i.e. there was no course level variation within a regression dataset). An additional important factor was the amount of times a student logged on to the course site. A significant number of students logged on to the course only one time. These students exhibited a significantly higher dropout rate than students who logged on to the course multiple times. It was also possible to generate additional explanatory variables (features) for students with multiple logons (e.g. time between log ons, number of log ons by type of log on). For this reason, I also segmented the data by multiple vs single log on.

I used the glmnet package in R to estimate a logistic regression with a penalty function. The glmnet function can be used to estimate Lasso regression, ridge regression, as well as a combination of the two. My intention in using glmnet was to use Lasso, or a function of Lasso, for variable reduction.

I’ve uploaded the R code for estimating glmnet here. I did some pre-procesing of the data using SAS, so the R code highlights the use of glmnet, rather than being complete code that reads in the raw data and creates features.

There are several points to highlight from the code:

I selected area under the curve (AUC) to evaluate the model performance. glmnet allows you to choose from several different evaluation metrics. The competition used AUC as the evaluation metric, so I was able to select the same metric to evaluate my models as was used to judge the competition results.
I used K folds cross validation to evaluate the model performance. K folds cross validation divides the data into K datasets, estimates the model on K-1 data, validates on the Kth data, repeats for each grouping, and takes the average of the validation results. The package uses the cross validation results to select the best parameter for the penalty term in the function.
I used a combination of lasso and ridge regression (alpha of 0.5). I had also tried alpha =1 (lasso) and alpha =0 (ridge), but 0.5 provided the best CV results. My understanding is that the c060 package can perform alpha tunig, but I did not have time to investigate.
K folds cross validation results can vary across draws of the data, especially for small datasets. For this reason, the process was repeated 50 times, and the median result was selected.

Overall, this approach resulted in AUC metrics that were 0.01 - 0.02 below my colleague’s approach of using GBM. Given the AUC differences, combining the model approaches did not improve results relative to using GBM alone. My guess is that glmnet did not perform as well as GBM because non-linearities and variable interactions need to be specified as model input variables, rather than allowing the algorithm to find nonlinearities and/or interactions.

Overall, the competition was a fun and useful exercise. I’m interested in learning more about using Lasso in the presence of collinearity of the features (explantory variables). When I tried using all of the features as inputs to the glmnet equation, the results were worse than when I selected a subset, removing highly correlated variables. If Lasso can be used as a method of variable selection, it would be interesting to learn how much preprocessing of the data is required.

Setting up

2015-11-30T00:00:00+00:00

I have very little experience working with web pages, but there is a lot of available info about using Jekyll and Poole to create a site on Github. I used Creating and Hosting a Personal Site on GitHub by Jonathan McGlone for the initial setup of this site. How I Created a Beautiful and Minimal Blog Using Jekyll, Github Pages, and poole by Joshua Lande also contains useful tips.