Data Science For Dummies. Lillian Pierson
Читать онлайн книгу.it’s represented by the number 0, for “no.”
The best way to gain a solid grasp on MCDM is to see how it’s used to solve a real-world problem. MCDM is commonly used in investment portfolio theory. Pricing of individual financial instruments typically reflects the level of risk you incur, but an entire portfolio can be a mixture of virtually riskless investments (US government bonds, for example) and minimum-, moderate-, and high-risk investments. Your level of risk aversion dictates the general character of your investment portfolio. Highly risk-averse investors seek safer and less lucrative investments, and less risk-averse investors choose riskier, more lucrative investments. In the process of evaluating the risk of a potential investment, you’d likely consider the following criteria:
Earnings growth potential: Using a binary variable to score the earnings growth potential, then you could say that an investment that falls under a specific earnings growth potential threshold gets scored as 0 (as in “no — the potential is not enough”); anything higher than that threshold gets a 1 (for “yes — the potential is adequate”).
Earnings quality rating: Using a binary variable to score earnings quality ratings, then you could say that an investment falling within a particular ratings class for earnings quality gets scored as 1 (for “yes — the rating is adequate”); otherwise, it gets scored as a 0 (as in “no — it’s earning quality rating is not good enough”).For you non-Wall Street types out there, earnings quality refers to various measures used to determine how kosher a company’s reported earnings are; such measures attempt to answer the question, “Do these reported figures pass the smell test?”
Dividend performance: Using a binary variable to score dividend performance, then you could say that when an investment fails to reach a set dividend performance threshold, it gets a 0 (as in “no — it’s dividend performance is not good enough”); if it reaches or surpasses that threshold, it gets a 1 (for “yes — the performance is adequate”).
Imagine that you’re evaluating 20 different potential investments. In this evaluation, you’d score each criterion for each of the investments. To eliminate poor investment choices, simply sum the criteria scores for each of the alternatives and then dismiss any investments that don’t earn a total score of 3 — leaving you with the investments that fall within a certain threshold of earning growth potential, that have good earnings quality, and whose dividends perform at a level that’s acceptable to you.
For some hands-on practice doing multiple criteria decision-making, go to the companion website to this book (
www.businessgrowth.ai
) and check out the MCDM practice problem I’ve left for you there.
Focusing on fuzzy MCDM
If you prefer to evaluate suitability within a range, rather than use binary membership terms of 0 or 1, you can use fuzzy multiple criteria decision-making (FMCDM) to do that. With FMCDM you can evaluate all the same types of problems as you would with MCDM. The term fuzzy refers to the fact that the criteria being used to evaluate alternatives offer a range of acceptability — instead of the binary, crisp set criteria associated with traditional MCDM. Evaluations based on fuzzy criteria lead to a range of potential outcomes, each with its own level of suitability as a solution.
One important feature of FMCDM: You’re likely to have a list of several fuzzy criteria, but these criteria might not all hold the same importance in your evaluation. To correct for this, simply assign weights to criteria to quantify their relative importance.
Introducing Regression Methods
Machine learning algorithms of the regression variety were adopted from the statistics field in order to provide data scientists with a set of methods for describing and quantifying the relationships between variables in a dataset. Use regression techniques if you want to determine the strength of correlation between variables in your data. As for using regression to predict future values from historical values, feel free to do it, but be careful: Regression methods assume a cause-and-effect relationship between variables, but present circumstances are always subject to flux. Predicting future values from historical ones will generate incorrect results when present circumstances change. In this section, I tell you all about linear regression, logistic regression, and the ordinary least squares method.
Linear regression
Linear regression is a machine learning method you can use to describe and quantify the relationship between your target variable, y — the predictant, in statistics lingo — and the dataset features you’ve chosen to use as predictor variables (commonly designated as dataset X in machine learning). When you use just one variable as your predictor, linear regression is as simple as the middle school algebra formula y=mx+b. A classic example of linear regression is its usage in predicting home prices, as shown in Figure 4-6. You can also use linear regression to quantify correlations between several variables in a dataset — called multiple linear regression. Before getting too excited about using linear regression, though, make sure you’ve considered its limitations:
Linear regression works with only numerical variables, not categorical ones.
If your dataset has missing values, it will cause problems. Be sure to address your missing values before attempting to build a linear regression model.
If your data has outliers present, your model will produce inaccurate results. Check for outliers before proceeding.
The linear regression model assumes that a linear relationship exists between dataset features and the target variable.
The linear regression model assumes that all features are independent of each other.
Prediction errors, or residuals, should be normally distributed.
Credit: Python for Data Science Essential Training Part 2, LinkedIn.com
FIGURE 4-6: Linear regression used to predict home prices based on the number of rooms in a house.
Don’t forget dataset size! A good rule of thumb is that you should have at least 20 observations per predictive feature if you expect to generate reliable results using linear regression.Logistic regression
Logistic regression is a machine learning method you can use to estimate values for a categorical target variable based on your selected features. Your target variable should be numeric and should contain values that describe the target’s class — or category. One cool aspect of logistic regression is that, in addition to predicting the class of observations in your target variable, it indicates the probability for each of its estimates. Though logistic regression is like linear regression, its requirements are simpler, in that:
There doesn't need to be a linear relationship between the features and target variable.
Residuals don’t have to be normally distributed.
Predictive features aren’t required to have a normal distribution.
When deciding whether logistic regression is a good choice for you, consider the following limitations:
Missing values should be treated or removed.