Data Analysis Part 1: An Introduction to Data Analytics

Every day, words like “Artificial Intelligence”, “Big Data”, “Data Analytics”, and “Machine Learning” are thrown around constantly these days, but what do they actually mean? Perhaps more importantly, why should you care?

This series is going to dive into the world of data analytics and provide information on these topics to help you understand the importance of data.

Lets get started. 

According to Forbes Magazine, every single day 2.5 quintillion bytes of data are created and 90% of the world’s data has been created in the last 2 years. The problem is that data does not necessarily translate to knowledge. Without some source of context, data and information can be impossible to analyze and comprehend. Before we get too far, I think it’s important that we define the differences between data, information, and knowledge. 

    • Data – facts and statistics collected for reference 
    • Information – data that has been processed into a form that is meaningful to the recipient
    • Knowledge – what has been understood and evaluated from the information

Data Analytics

The translation from data to knowledge is vitally important. We use data to diagnose diseases, land planes, and make decisions. If the data is not presented in a way that is readable and comprehensible, we cannot translate it into information, or knowledge. There needs to be a way to take the data that is hard to explain with words or numbers and present it in a way that allows data to be translated into information and knowledge accurately (we will discuss some methods to do this later). 

We, as humans, do data analytics all the time. Whenever we compare prices of different kinds of spaghetti sauce at the store or when we shop at multiple stores to find the best value for our money, we are doing data analytics. 

Typically, when people think of data analytics or statistics, they think of measures like mean (or average), median, and mode. While these statistics can be useful, sometimes we need to find patterns, trends, and relationships between variables. For this, we use data visualizations. There are numerous types of graphs and charts that can be used to show how variables impact each other and we will dive into this in a future article.

We can also train computers to recognize these patterns for us. When we train a computer to process data without any explicit instructions like this, it is called machine learning. These machine learning models make predictions about new data by learning from existing data to make an educated guess. So, for example, I could train a machine learning model with this data on survivors of the titanic: 

Passenger ID Sex Age Survived
1 Male 22 0
2 Female 38 1
3 Female 26 1
4 Female 35 1
5 Male 35 0
6 Male   0
7 Male 54 0
8 Male 2 0
9 Female 27 1
10 Female 14 1

If I created a model from this data set, it most likely would come to the conclusion that males did not survive and females did, so if I introduced a new row of data like this: 

Passenger ID Sex Age Survived
11 Male 27 ?

The machine learning model would most likely tell me that this person did not survive the titanic based on their gender. This is a very small data set, so it’s not ideal for training machine learning models, but on a larger scale, this can be extremely helpful.

There are many tools available for transforming, visualizing, cleaning, and classifying data. Some of these include R and R Studio, Matlab, and Python. Throughout this series, I will be using R to walk through some examples as it is one of the more common tools used amongst data scientists. 

We will also, cover the basics of data and data types, data visualizations, cleaning data, clustering and classifying data, and data regression.

Part 2: What is Data will be available next Thursday, February 13. In the meantime, follow our linkedin page for more great content!

 

Contact Us

If you like this content or have suggestions for other topics you’d like us to cover please let us know, we’d love to hear from you. You can reach us at contact@query.ai. You’re also free to follow us on linkedin and visit www.query.ai to subscribe to our updates.

 

Resources:

https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#2fe71b3860ba

Chaim Zins. (2007). Conceptual approaches for defining data, information, and knowledge. https://doi.org/10.1002/asi.20508

https://www.kaggle.com/c/titanic/data

https://www.youtube.com/playlist?list=PLzH6n4zXuckpfMu_4Ff8E7Z1behQks5ba

 

Posted by Alexis Vander Wilt

I am a senior Computer Science and Mathematics student, with a passion for understanding Data Analysis and its impacts. I work as part of the team at Query.AI where we are using Natural Language Processing to allow users to “talk to your data” reducing the security learning curve and working to make security more accessible to all.