Syed Hamad ul Haq Andrabi

Sales forecasting is the process of estimating future sales revenue for a given time period. It is an essential business function that helps organizations make informed decisions about future investments, production planning, and resource allocation. Sales forecasting is a vital process for businesses that want to plan, budget, invest, evaluate performance, and gain a competitive advantage It’s important to note that some degree of uncertainty is always present and no algorithm can work perfectly for sales forecasting. However, by using a combination of methods and regularly reviewing and adjusting forecasts, organizations can make more accurate sales predictions and make better-informed business decisions. Our goal through this project is to compare performance of various models for sales forecasting and also try out an ensemble approach as well to see if the efficiency and accuracy of the model can be get better as compared to the singular models. For a dataset of the size of the Walmart data set, it is very important to choose the best algorithm which not only shows accuracy but efficiency as well. There are various machine learning models that can be used for sales prediction, including linear regression, decision trees, random forests, gradient boosting, neural networks, and support vector machines. We need to consider the strengths and weaknesses of each model and select the one that best fits the characteristics of the data and the problem we want to solve. The models that will be implementing for this study are XGBoost, LGBM and Catboost. These are gradient boosting models and extremely suitable for large datasets with regression which use regression techniques. Catboost uses symmetric trees while XGBoost and LGBM use asymmetric trees where XGBoost grows level wise and LGBM grows leaf wise. All three models have their own advantages which we need to utilize and see how much better the accuracy can be made. Among the three XGBoost has the advantage of flexibility and scalability, and can also be customized to fit the best hyper parameters. LGBM is the best when it comes to memory efficiency as well as the overall efficiency of the model. However we expect CatBoost which is a special version of GBDT to have the overall best accuracy. The performance of a machine learning model must be evaluated using appropriate metrics such as mean absolute error, mean squared error, root mean squared error, R-squared, and others. To utilize all the models and to try to reach the full potential of each model we will also implement a hybrid ensemble model in the form of a voting regressor. This helps remove any biases that the individual models may have and to enhance the overall performance of the model. For the voting regressor we will be using MAE values of each model to assign scores.