Introduction to Data Mining

Wednesday, November 1, 2017

Statistical Analysis and Data Mining; From Statistical Analysis To Artificial Intelligence, To Hell With Big Data

From Statistical Analysis To Artificial Intelligence, To Hell With Big Data

One shitty step at a time, through the quagmire of messy data
One fine Monday morning, a friend asked me to meet for coffee. Told him I avoid coffee for gastric issues, so we go somewhere where he could drink his black coffee with no sugar, and I could drink beer before noon. There, we got to talk about Big Data, the hype and the grit, and all the startups who tried to harness it and failed miserably due to various reasons. This discussion got me wondering how your regular Joe can develop the skill set necessary to work on a sufficiently large amount of data without burning anybody alive.
For a little context, Indonesians are generally avoidant towards anything mathematical, preferring to work on the narrative-interpretive side of everything from economics to philosophy. I believe it has something to do with how awful the education on mathematics was during their formative years, with lots of rote memorisations more than building mathematical familiarity and intuition for problem solving.
In such a context, the statisticians are mostly not well equipped to develop their own statistical models and sorts, relying mostly on cross-tabulations and descriptives. Most of them are familiar with regressions, but for anything harder than an OLS regression, and you’ll lose some 60% of your potential employee. Only a pitiful few could work on Structural Equation Modelling, and don’t talk with them about Gradient Descent. Also, most of people familiar with statistics have higher probability to be found around the mosque than munging their data on cafés.
Also, with most of the ‘researchers’ in Indonesia being far more familiar with null hypothesis testing, ANOVA, and p-values than the alternatives, it comes to no surprise that there’s a big knowledge gap between them and the rest of the civilised world. Most of them work on the procedural sort of statistical analysis instead of model building, or even k-Nearest Neighbour regression.
With such a cultural background, my friend asked me how to make it possible for Indonesian ‘statisticians’ to work on the current trends such as doing machine learning, building data products, and even AI. This got me thinking, before I got this idea that I’m sharing with everyone reading it.
Think about it, intuitively, like a set of incremental automations.
First, you start off like a moron, doing all your data exploration and analysis on a spreadsheet, making all the pivots, cross-tabulations, charts and descriptives, like a machine. This is what K-12 students used to do, if memory serves.
Then, you start using some fancy built-in functions in the data exploration and analysis app that your university provided you with. From SPSS, Minitab, even STATA, Matlab, Mathematica, and JMP. These are useful crutches to help you with your projects, but it’s going to be quite hard to automate them. Thankfully, staying awake is easy when you’re in your undergrad years.
Then, as you grow tired of staying awake doing all these button mashing with nobody getting punched, you started to think to find a way on automating your analysis process. Maybe you can REGEX the texts, automate the descriptives, create the distribution functions using command lines. Who knows, maybe you can even make your whole analysis process into a program of its own.
When you start automating the process, you started to realise that you need to tell your machine to treat a certain data frame differently than other data frames. Maybe some matrices require different treatments, given sufficient conditions are met. You start to use some general purpose programming language to tell your machine how to operate while you’re off drinking somewhere with a potential date.
With the analysis process automated, you can start finding out problems in your dataset, and you need to use a whole different set of algorithms to make sense of your messy data. One to cleanse the data, one to normalise the distributions, one to make sure everything is sorted the right way, and many special functions for your unique set of data. You start to grow familiar with the diversity of algorithms people fork from many sources, and see how well they work. Then you start to automate even this process, and set up a machine that could learn about the data and make consistent predictions.
As your collection of learning machines grow large, you start to connect each one of those machines to a network of learning agents that could exchange information and update their collective knowledge about whatever things you’re feeding them. Here’s where you have a start of an artificial intelligence for your analysis while you go out chasing shadows.
But don’t mind the ramblings of a drunken dude. Didn’t write this for Nature or Science magazine, take this idea with some salt, lime, and a shot of tequila. Cheers.
By Kamen
Source : https://medium.com/@lurino/from-statistical-analysis-to-artificial-intelligence-to-hell-with-big-data-34f769cadb89

No comments:

Post a Comment