Big Data

Introduction to Data Mining

Wednesday, November 8, 2017

This is not an easy question because there is no common agreement on what “Data Mining” means. But, I am going to say that I disagree with the answer from Wikipedia that Yuvraj Singla points to. I don’t think saying that machine learning focuses on prediction is accurate at all although I mostly agree with the definition of Data Mining focusing on the discovery of properties on the data.
So, let’s start with that: Data Mining is a cross-disciplinary field that focuses on discovering properties of data sets. (Forget about it being the analysis step of “knowledge discovery in databases” KDD, this was maybe true years ago, it is not anymore).
There are different approaches to discovering properties of data sets. Machine Learning is one of them. Another one is simply looking at the data sets using visualization techniques or Topological Data Analysis
On the other hand Machine Learning is a sub-field of data science that focuses on designing algorithms that can learn from and make predictions on the data. Machine learning includes Supervised Learning and Unsupervised Learning methods. Unsupervised methods actually start off from unlabeled data sets, so, in a way, they are directly related to finding out unknown properties in them (e.g. clusters or rules).
It is clear then that machine learning can be used for data mining. However, data mining can use other techniques besides or on top of machine learning.
Btw, to make things even more complicated, now we have a new term, Data Science, that is competing for attention, especially with Data Mining and KDD. Even the SIGKDD group at ACM is slowly moving towards using Data Science. In their website, they now describe themselves as “The community for data mining, data science and analytics[1]. My bet is that KDD will disappear as a term pretty soon and data mining will simply merge into data science.

Originally published at www.quora.com.

Sunday, November 5, 2017


Data mining can be technically defined as the automated extraction of hidden information from large databases for predictive analysis. In other words, it is the retrieval of useful information from large masses of data, which is also presented in an analyzed form for specific decision-making.

Data mining requires the use of mathematical algorithms and statistical techniques integrated with software tools. The final product is an easy-to-use software package that can be used even by non-mathematicians to effectively analyze the data they have. Data Mining is used in several applications like market research, consumer behavior, direct marketing, bioinformatics, genetics, text analysis, fraud detection, web site personalization, e-commerce, healthcare, customer relationship management, financial services and telecommunications.

Business intelligence data mining is used in market research, industry research, and for competitor analysis. It has applications in major industries like direct marketing, e-commerce, customer relationship management, healthcare, the oil and gas industry, scientific tests, genetics, telecommunications, financial services and utilities. BI uses various technologies like data mining, scorecarding, data warehouses, text mining, decision support systems, executive information systems, management information systems and geographic information systems for analyzing useful information for business decision making.
Business intelligence is a broader arena of decision-making that uses data mining as one of the tools. In fact, the use of data mining in BI makes the data more relevant in application. There are several kinds of data mining: text mining, web mining, social networks data mining, relational databases, pictorial data mining, audio data mining and video data mining, that are all used in business intelligence applications.
Some data mining tools used in BI are: decision trees, information gain, probability, probability density functions, Gaussians, maximum likelihood estimation, Gaussian Baves classification, cross-validation, neural networks, instance-based learning /case-based/ memory-based/non-parametric, regression algorithms, Bayesian networks, Gaussian mixture models, K-means and hierarchical clusteringArticle Search, Markov models and so on.

source : http://www.articlesfactory.com/articles/computers/business-intelligence-data-mining.html

Thursday, November 2, 2017

When setting up an analytics system for a company or project, there is often the question of where data should live. There is no one-size-fits-all solution here, as your budget, the amount of data you have, and what performance you want will determine the feasible candidates. This post will go over the categories of databases available, and hopefully give you a rough map of the choices.

Your application database

The simplest option by far is to just use whatever is currently storing your application data.
Common Examples:
Pros:
  • It’s already up
  • You don’t need to manage another database server
  • You don’t need to deal with moving data around
  • You don’t need to transform the data
Cons:
  • Analytics workloads might slow down your application
  • Scaling will become more difficult with two very different modes of use
  • The data schema is often difficult to use for analytics
This is usually an intermediate stage for “real” applications, but for small internal applications, pre-launch or prototypes it is a viable choice. Once you get ready to launch (for consumer applications) you’ll typically want to migrate off this to a more scalable choice below.

A read-replica of your application database

If your main database supports read replicas the next laziest thing you can do is create a read replica of your main database, perhaps set up another namespace with your 3rd party data or events, and call it a win.
Pros:
  • You don’t need to manage another kind of database
  • You don’t need to explicitly move data around
  • You can scale analytics and transactional load independently
  • You don’t need to transform the data
Cons:
  • Database software that is optimized for transactional loads are usually suboptimal for analytics purposes
  • The data schema is often difficult to use for analytics
  • You need to manage another database server
Typically, once you start getting serious about analytics and your scale (both in data volume and complexity of analytical queries) increases, there are significant wins in performance that drive the move to a dedicated analytics database or “data warehouse” as it is often called.

A separate data warehouse running your “normal database”

If you don’t have scale that requires you to run a database on many machines you can get away with using the same database you use for your application for a dedicated analytics data warehouse. This will often have different settings, be tuned differently and will involve actively reshaping the way your data is laid out in tables to make analytics queries faster and easier to write.
Pros:
  • You don’t need to manage another kind of database
  • You can scale analytics and transactional load independently
  • You can optimize the data model/schema to the analytical work you wish to perform
Cons:
  • You need to manage another database server
  • Database software that is optimized for transactional loads are usually suboptimal for analytics purposes
  • You need to move data around
  • You will typically need to transform data into a more useful format
  • These databases are typically limited to a single node, which limits scalability
Typically, you can get away with this until around 10–100M rows depending on your data and desired analytics workloads. Once you reach a point where common queries are taking minutes or longer, you should evaluate options with more horsepower.

SQL based analytics databases

The main distinctions between “normal” database software and databases intended for heavy analytics workloads are parallelization and data format.
Transactional workloads typically have many small reads, writes and updates. They typically can live on a single machine for much longer than analytics workloads for a given company. Analytics workloads on the other hand, have less frequent read operations that touch much larger amounts of data. For example, a common transactional query is to check a given user’s last login time to display it for them. A common analytical query is counting all user logins over the last 3 months to create a graph.
The other main difference is that transactional database software typically stores in row format. For example, let’s say we have a table with user records. A user record includes their name, address, last login time and date of birth. A typical transactional database will store all four of those fields in one unit, which lets it retrieve (or update) that record very quickly. Conversely a typical analytical database will store all of the names together, all of the last login times together, etc. This makes operations like “what is the average age of our userbase” ignore all of the data in the database that is not the user’s date of birth. By reducing the amount of data the database needs to scan, it dramatically improves performance. However, this tradeoff means that analytics databases are typically much worse at transactional style queries than databases specialized for that purpose.

Open source SQL based analytics databases

There are a number of open source options for analytics databases. These are typically based on PostgreSQL, a popular general purpose database. In both cases the database started off closed source, built by a company and later open sourced.
Common Examples:
Pros:
  • Free + Open Source
  • Uses a PostgreSQL dialect, which is widely understood
  • Scalable
Cons:
  • You need to manage another database server(s)
  • You need to host it yourself
  • More complicated to tune than a single server
  • Need to think about data partitioning
  • More complicated to tune than a single server

Hosted SQL based analytics database options

Two of the main Infrastructure-as-a-Service providers (Amazon and Microsoft) offer fully managed Data Warehouses. Google offers BigQuery which is technically not SQL but has a similar query language. These are often a great deal if you don’t have much database administration expertise in-house. Typically, companies will use the database offering that their main IaaS vendor uses, though there are increasingly cases of companies using the data warehouse offering without moving any of their other computing to the cloud provider.

Redshift

Given Amazon’s dominate position in the cloud/IaaS space, it’s not surprising that many people are using their hosted data warehouse. It’s not as performant as other options, and generally is a pain to get data into compared to some other options. However, if you’re already on AWS, it’s generally the cheapest and easiest option overall.
Pros:
  • Fully managed
  • Pay as you go
  • Uses a PostgreSQL dialect, which is widely understood
  • Scalable
Cons:
  • Getting data in is fairly complicated
  • Need to think about data partitioning
  • You need to manually scale up
  • Network I/O cost if you don’t use AWS

Azure

Pros:
  • Fully Managed
  • Pay as you go
  • Uses a SQL Server dialect of SQL
  • Scalable
Cons:
  • You need to manually scale up
  • Network I/O cost if you don’t use Azure

Proprietary analytics databases

There are a variety of sophisticated (and expensive) database servers optimized for analytical workloads. If you’re reading this guide, chances are you’re not in the market for a 6–7 figure engagement with a database vendor.
Pros:
  • Strong services component if you need help (and can pay)
  • Long operational histories
  • Experience with complicated deployments
Cons:
  • You need to manage another database server(s)
  • Expensive
  • Typically very complicated to setup and administer
  • Did we mention EXPENSIVE?
Common Examples:

BigQuery

For a while, BigQuery (known internally and in the research literature as Dremel) was one of Google’s semi-secret weapons. Externally, it powered Google Analytics. These days if you pay up for Google Analytics Premium, you can see all the raw data via BigQuery. It uses a SQL-inspired language called GQL, and is absurdly fast. Rather than tying you to paying for hosts running software, it uses a fleet of machines that you don’t need to care about and charges you by your data size and how much cpu/io your queries use.
Pros:
  • FAST
  • Scales transparently
  • Pay by compute and storage you use vs hardware-hour
Cons:
  • Getting data in is fairly complicated
  • Less predictable pricing

Hadoop

In many ways, Hadoop is responsible for the buzz around Big Data. Originally built to power an open source web crawler, it was heavily adopted by Yahoo when they hired its author, Doug Cutting in 2006. While many big database companies can make claims to having run large database clusters before Hadoop, it nevertheless caused a sea change in how companies thought about large amounts of data. Being free, relatively low level and flexible, it allowed many companies that were not ready to slap down 10s of millions of dollars to experiment with their growing datasets. Consumer internet companies, home to both huge data volumes and an affinity for open source solutions dove in head first.
These days Hadoop has spawned an entire ecosystem around itself. While originally billed as NoSQL, the majority of applications now center around the various SQL-on-Hadoop projects.

Low level options

While initially used as a data warehouse language, MapReduce (and its probable successor, Spark), have been phased out of end-user analytics use for the most part. They still play large parts in data transformation, machine learning, and other data infrastructure and engineering work.
Pros:
  • Can scale to massive data sets
  • Very flexible
Cons:
  • Typically slow
  • Can be complicated to operate
  • Languages are very low level
  • Generally not much tool support

SQL-on-Hadoop

Starting with the use of Hive at Facebook, SQL has become the primary way to analyze data on Hadoop clusters for analytics or business intelligence workloads. It allows companies to use standard tools and a lingua franca of analytics on massive amounts of data.
Common Projects:
Pros:
  • Can scale to massive data sets
  • Use common SQL dialects
  • Decent tool support
  • Can be fast
Cons:
  • Languages are very low level
  • Requires running a Hadoop cluster
While Hive was the original SQL-on-Hadoop project, it has been eclipsed by the others in this list. In 2016, there’s little to no reason to start using Hive, but it is still in widespread use in some organizations.

Elasticsearch

Elasticsearch is more commonly used to power site search or log archival. However, its query language has also been used to power analytics applications. It shines at scale, and offers exceptional performance at the cost of a very significant management and administration footprint.
Pros:
  • FAST
  • Strong ability to search your data
  • If you use Elasticsearch for your application, you can also use it to power analytics
Cons:
  • Slow ingestion
  • Not very efficient in terms of diskstorage
  • Difficult query language that is optimized for search, not analytics

Crate.io

Crate offers a SQL layer on top of Elasticsearch. This combines the performance of Elasticsearch with the more widespread understanding of SQL. The actual SQL dialect is similar to MySQL or PostgreSQL but will require a bit of study by your analysts.

In-memory databases

There’s a long-running saying that “You don’t have a big data problem if your dataset fits into RAM.” If you can fit your database into RAM, and are willing to deal with a less flexible set of queries possible than a full-SQL system, performance with these database servers is lightning fast. RAM is getting cheaper every year, and more and more companies are taking this route. It often is used in combination with a big SQL or Hadoop cluster for batch or more complicated computation.
Pros:
  • FAST
  • FAST
  • FAST
Cons:
  • You need enough RAM to fit all your data
  • Have their own query languages you’ll need to learn
  • Can be tricky to administer
  • Typically don’t offer good ways to query across multiple tables

Open source in-memory databases

Common examples:

SAP Hana

SAP offers an in-memory database, HANA, which has a fair bit of traction in larger enterprises. Like everything else from SAP (or really any other big Enterprise Database company), it’s expensive and probably going to require a small army of consultants to get going.
Originally published on Metabase
Source : https://medium.com/@metabase/which-data-warehouse-should-you-use-a49f1f126ed3

Wednesday, November 1, 2017

From Statistical Analysis To Artificial Intelligence, To Hell With Big Data

One shitty step at a time, through the quagmire of messy data
One fine Monday morning, a friend asked me to meet for coffee. Told him I avoid coffee for gastric issues, so we go somewhere where he could drink his black coffee with no sugar, and I could drink beer before noon. There, we got to talk about Big Data, the hype and the grit, and all the startups who tried to harness it and failed miserably due to various reasons. This discussion got me wondering how your regular Joe can develop the skill set necessary to work on a sufficiently large amount of data without burning anybody alive.
For a little context, Indonesians are generally avoidant towards anything mathematical, preferring to work on the narrative-interpretive side of everything from economics to philosophy. I believe it has something to do with how awful the education on mathematics was during their formative years, with lots of rote memorisations more than building mathematical familiarity and intuition for problem solving.
In such a context, the statisticians are mostly not well equipped to develop their own statistical models and sorts, relying mostly on cross-tabulations and descriptives. Most of them are familiar with regressions, but for anything harder than an OLS regression, and you’ll lose some 60% of your potential employee. Only a pitiful few could work on Structural Equation Modelling, and don’t talk with them about Gradient Descent. Also, most of people familiar with statistics have higher probability to be found around the mosque than munging their data on cafés.
Also, with most of the ‘researchers’ in Indonesia being far more familiar with null hypothesis testing, ANOVA, and p-values than the alternatives, it comes to no surprise that there’s a big knowledge gap between them and the rest of the civilised world. Most of them work on the procedural sort of statistical analysis instead of model building, or even k-Nearest Neighbour regression.
With such a cultural background, my friend asked me how to make it possible for Indonesian ‘statisticians’ to work on the current trends such as doing machine learning, building data products, and even AI. This got me thinking, before I got this idea that I’m sharing with everyone reading it.
Think about it, intuitively, like a set of incremental automations.
First, you start off like a moron, doing all your data exploration and analysis on a spreadsheet, making all the pivots, cross-tabulations, charts and descriptives, like a machine. This is what K-12 students used to do, if memory serves.
Then, you start using some fancy built-in functions in the data exploration and analysis app that your university provided you with. From SPSS, Minitab, even STATA, Matlab, Mathematica, and JMP. These are useful crutches to help you with your projects, but it’s going to be quite hard to automate them. Thankfully, staying awake is easy when you’re in your undergrad years.
Then, as you grow tired of staying awake doing all these button mashing with nobody getting punched, you started to think to find a way on automating your analysis process. Maybe you can REGEX the texts, automate the descriptives, create the distribution functions using command lines. Who knows, maybe you can even make your whole analysis process into a program of its own.
When you start automating the process, you started to realise that you need to tell your machine to treat a certain data frame differently than other data frames. Maybe some matrices require different treatments, given sufficient conditions are met. You start to use some general purpose programming language to tell your machine how to operate while you’re off drinking somewhere with a potential date.
With the analysis process automated, you can start finding out problems in your dataset, and you need to use a whole different set of algorithms to make sense of your messy data. One to cleanse the data, one to normalise the distributions, one to make sure everything is sorted the right way, and many special functions for your unique set of data. You start to grow familiar with the diversity of algorithms people fork from many sources, and see how well they work. Then you start to automate even this process, and set up a machine that could learn about the data and make consistent predictions.
As your collection of learning machines grow large, you start to connect each one of those machines to a network of learning agents that could exchange information and update their collective knowledge about whatever things you’re feeding them. Here’s where you have a start of an artificial intelligence for your analysis while you go out chasing shadows.
But don’t mind the ramblings of a drunken dude. Didn’t write this for Nature or Science magazine, take this idea with some salt, lime, and a shot of tequila. Cheers.
By Kamen
Source : https://medium.com/@lurino/from-statistical-analysis-to-artificial-intelligence-to-hell-with-big-data-34f769cadb89