Many a times, you may want to transform a machine learning task to classification problem from a regression problem. For example, suppose you have continuous values as your expected target/ground truth and you want to convert it to a classification problem. How do you do it?
Look for a cut-off
First things, first: look into the literature to figure out if there is any cut-off at all. That’s the easiest thing to do. Suppose you are trying to predict someone’s IQ (just an example), if you look into the literature, on the basis of the test that was administered, it should be very easy to find a cut-off threshold. For example, literature related to the test could say above 100 as good, and below 100 as poor IQ. You could then use that information to give class 1 to above 100 target values and 0 for below 100. Try classifying after that.
But, say there’s no exact cut-off of these values such that you can say anything above X is of class 1 and anything below, 0. Then what? If there’s no inherent classes backed up by the literature, the next thing you could try is classifying based on the average/median/quantile of the sample you have. For example, take the average of the continuous ground truth available on the train dataset. Then using that as a cut-off, mark anything above that as a 1, and anything below 0. Use the same cut-off on the test set (remember, we have to have the same distribution on both the training and testing dataset, so we use the same threshold as a cut-off for test set as well). Try classifying based on it. It works but may not be the smartest way to go about converting the problem.
Also the choice of mean/median/quantile depends on the distribution of the ground truth or how much classes you need — for example, if its a normally distributed dataset which has been already outlier treated, you might want to go with mean. If it is skewed and/or you need more classes, quantile is a better option.
You can google “median split” to get some more ideas about the approach. Also make sure that you’re doing this for continuous values. Using such an approach with ordinal data (such as likert scale) doesn’t make sense – you cannot say that a median score of 3.4 in likert scale is a cut-off because it’s an ordinal scale.
In python, you could could convert to classes based on quantile and mean with something along the lines of:
if cur_ground_truth > final_df['ground_truth'].quantile(0.75):
quantile = 3
elif cur_ground_truth < final_df['ground_truth'].quantile(0.25):
quantile = 1
quantile = 2
mean = final_df['ground_truth'].mean().astype('int')
Another approach to try is to first cluster the dataset into maybe 2 clusters and then use that separation as a class. For all data points that fall into one cluster, assign them a class of 1, and 0 for all the others. Then try classifying them. You have to be careful to select a proper centroid for the start — for example, set it to the highest achievable value (for the ground truth) in one cluster whereas the lowest achievable value (for the ground truth) in the another cluster.
Capture the dynamics instead
You could try to change the problem you’re investigating so that you look into predicting the dynamics instead. Say for example, you are trying to classify whether someone has high IQ or low IQ. Instead of taking average over the whole sample and using that as a cut-off, you could maybe try classifying whether the given person’s IQ goes down (class 0) or goes up (class 1). Think of it this way: You use say a year of time series data of 100 people and have up or down as your class. Every week, you ask them to take a IQ test, if it goes down, you give it a class of 1, if it goes up, you give it a class of 0 (your training set would be some features you collected, and your ground truth will be up/down). Then your model will be able to say whether that person’s IQ will go down or up based on that week’s data.