← Back To All Blogs

How to solve Real world problem like a pro data- scientist

- 3 minutes read | By Harshit Singh

What Is data science?

Multi-disciplinary field that brings together concepts from computer science, statistics/machine learning, and data analysis to understand and extract insights from the ever-increasing amounts of data.

Key Tip To be great at data science :

Always be curious about the world and ask these two questions to yourself, often:
1. What can we learn from this data?
2. What actions can we take to find the trend in the data?

Types of data we get in the Real World :

1) Structured: Data that has predefined structures. e.g. tables, spreadsheets, or relational databases.

2) Unstructured Data: Data with no predefined structure, comes in any size or form, and cannot be easily stored in tables. e.g. blobs of text, images, audio

3) Quantitative Data: Numerical. e.g. height, weight Categorical Data: Data that can be labeled or divided into groups. e.g. race, sex, hair color.

4) Big Data: Massive datasets, or data that contains, a greater variety arriving in increasing volumes and with ever-higher velocity (3 Vs). Cannot fit in the memory of a single machine. Put simply, big data is larger, more complex data sets, especially from new data sources. These data sets are so voluminous that traditional data processing software just can’t manage them. But these massive volumes of data can be used to address business problems you wouldn’t have been able to tackle before. [Sources: oracle.com]

5) We get data in different formats, commonly we get CSV, XML, SQL, and JSON. Sources of these data include - Companies/Proprietary Data, data from APIs, Government, academics, Web Scraping, and Crawling.

7 WONDER (STEPS) to solve a Real World problem -

STEP 1 - Problem Statement - A well-defined problem can save you a lot of time and trouble. The objective of a problem statement is to state the problem faced by the firm and has to be SMART.
[ Specific, Measurable, Action to solve your problem, Relevant method, Time-bound ]

STEP 2 - Data Collection - Data collection is the process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer stated research questions, test hypotheses, and evaluate outcomes.
Key tip: Data collection can take time so don’t rush this step!

STEP 3- Data Cleaning - Data Cleaning is the process of turning raw data into a clean and analyzable data set. ”Garbage in, garbage out.” Make sure garbage doesn’t get put in.
Key tip: Around 80% of your time will be spent cleaning data. You cannot overlook this step!!!!

STEP 4- Exploratory Data Analysis (EDA) - The idea of EDA is to get more understanding of the data. To analyze data we perform statistical characteristics, create visualizations, test hypotheses, and detect outliers.

STEP 5- Feature Engineering - is the process of using domain knowledge to create features or input variables that help machine learning algorithms perform better.
“Coming up with features is difficult, time-consuming, and requires expert knowledge. ‘Applied machine learning is feature engineering.” -Andrew Ng [Stanford]

STEP 6- Modeling -Data modeling is the process of producing a descriptive diagram of relationships between various types of information that are to be stored in a database. Data preprocessing helps to enhance your data quality by organizing raw data in a suitable format. Using this data to apply machine learning models, tuning the hyperparameters, and choosing the best-optimal model.

STEP 7- Communication - Finally, the last and the most important step is to present the conclusion. Use a storytelling approach, formal report, or even a blog post to showcase your conclusions.

Next step

A project tutorial......Comming Soon

Comment about the blog👇

Knowledge is power. Knowledge shared is power multiplied.


Work done byHarshanz for iamdata