“When life gives you lemons, make Lemonade” has now changed to “when you have data, make decisions with it”
Just as many other blog posts similarly, Rate of Data accumulation is rapid, in a very unimaginable way. A recent study at the Northeastern University states that 2.5 Exabyte of data are produced every single day, which is equivalent to
- 530,000,000 Million Songs
- 150,000,000 iPhones
- 5,000,000 Laptops
- 250,000 Libraries of Congress
- 90 Years of HD Videos
Now that is definitely a lot. Imagine, if this is what the entire world produces, wouldn’t your organization too contribute significantly to it? It would, without a doubt. So what does this data do? Typically for Small and Medium sized business which starts generating a lot of money and data in its financial plateau, all the money goes back into the business to make more money. Same way, use all the collected data to create more data, but data that is more meaningful, that talks about the business and its scalability.
Data Analytics is no more a buzz word. It is becoming prevalent in every single imaginable vertical. Not that people and organizations pick this up because it is/was a buzz word; it is because establishments see the potential of it. Thanks to the contribution this discipline gets on OpenSource, it is becoming available to everyone now. All that is required to communicate with the data is the willingness to do so. And, a skillset that is little niche is required. Look at the following Venn diagram.
The simplest definition of data analytics is, it is the conjuncture of mathematics, computer science and domain expertise.
For majority of the businesses, the ability to make data driven decisions comes from the quality of data available and the how wisely it is being used. For instance, in order to predict marketing ROI of the digital campaign that your firm decides to run on a selective product line, it is more important to consider the past campaign results and the attributes that you think are improved in the targeted campaign, rather than considering the overall sales data and deficient products per batch as it would make no sense and we would not get anywhere with it. It is like comparing apples with oranges. This is where domain expertise plays a pivotal role. A wind engineer would know what factors to consider during setting up of windmills based on geo-spatial location and not a financial analyst who is good with balance sheets even though they both know the math and got the computer skills.
Statistics forms the basis for analytics. It is all about testing for hypothesis or arguments all the way. Few essential statistical concepts go into helping the organization into decision making. There is an order:
- Data collection
- Formulation of Argument
- Selection of Test Statistic
- Computation and Testing
Begin with are Hypothesis Testing, Chi-squared test, T-tail and Analysis of Variance. It is also important to know the concepts of Linear, Univariate, Multivariate and Logistic regression analysis techniques for each and every technique has unique use cases depending on the number and type of business problem and the predictor variable into study. The status quo is, A minimum of 30 observations and its related attributes are considered as a large dataset and all these approaches hold good on them.
As the size of the data grows, the above mentioned steps cannot be adapted for decision making as such as they might lead to erroneous results. So, few tweaks such as Factor Analysis, Clustering and CART which constitutes to the essentials of data reduction are employed followed by the statistical efforts.
Programing / Computer Skills:
Software and Statistical packages have really saved time and effort over the years. Imagine doing manual calculation for each and every observation until the result is obtained. The margin of error would also me high and it is the last thing that a business wants to do while going by a data driven decision.
Solution has been made available to enterprises since late 2000’s to help in decision making but thanks to R which is right now dominating the OpenSource fraternity. Look at the following graph which talks about the % of questions asked on a major discussion forum (probably the world’s leading one) versus the years of observation.
The steep inclination in R would definitely ring a bell as it is the most sought-after statistical package software that helps in decision making. More importantly, it is opensource and the world knows its potential owing to the humongous support it receives from communities and contributors. Next to which, Tensorflow, Google’s open-source machine learning framework, was introduced only in late 2015, and this platform is on the rise.
Some advantages of ‘R’ are enlisted:
- Free and Opensource – Anyone can use and more importantly, adapt it.
- Large Repositories – 4800 packages available from multiple repositories specializing in topics like econometrics, data mining, spatial analysis, and bio-informatics.
- Great and voluminous cookbooks & support from opensource community
- Strong Graphical capacities – Fully programmable graphics that can propel the decision making process through reporting and interpreting data (Descriptive, Predictive and Prescriptive)
- Supports all major Data types and files – Data can be imported in the CSV , SAS, and SPSS, or directly from Microsoft Excel, Microsoft Access, Oracle, MySQL formats and the graphical outputs can be saved as PDF, JPG, PNG, and SVG formats, and table output for LATEX and HTML.
Having said all that, the “Early Bird gets the worm”. Making data driven decisions are more important than the image that it perceives right now, which directly is proven to impact the upside that you continue to have with on your direct competitor.
Needless to say, “The worlds more precious resource right now is not Oil. It is DATA. Make use of it before it becomes too volatile to use.