1.

Master Linear Algebra basics for Data Science

{{whyLabel}}: Linear algebra is the language of data; matrices and vectors are how computers store and process datasets.

{{howLabel}}:

Learn vector operations (addition, dot product) and matrix multiplication.
Understand Transpose, Inverse, and Determinants.
Study Eigenvalues and Eigenvectors as they are critical for Principal Component Analysis (PCA).

{{doneWhenLabel}}: You can manually multiply a 2x2 matrix and explain what a vector represents in a coordinate system.

2.

Learn Calculus for Optimization

{{whyLabel}}: Calculus, specifically derivatives, is the engine behind 'Gradient Descent'—the method models use to learn from errors.

{{howLabel}}:

Focus on Derivatives and the Power Rule.
Understand the Chain Rule for nested functions.
Learn Partial Derivatives to handle functions with multiple variables (features).

{{doneWhenLabel}}: You can explain how a derivative helps find the minimum of a cost function.

3.

Calculate Descriptive Statistics

{{whyLabel}}: You must be able to summarize large datasets into meaningful numbers like averages and spreads.

{{howLabel}}:

Practice calculating Mean, Median, and Mode.
Understand Variance and Standard Deviation to measure data 'noise'.
Learn about Percentiles and Quartiles for identifying outliers.

{{doneWhenLabel}}: You can describe a dataset's distribution using its five-number summary (Min, Q1, Median, Q3, Max).

4.

Apply Probability Theory and Bayes' Theorem

{{whyLabel}}: Data science is about predicting the likelihood of events; Bayes' Theorem is the foundation of many classification algorithms.

{{howLabel}}:

Study Conditional Probability (P(A|B)).
Solve problems using Bayes' Theorem (e.g., medical test accuracy).
Learn about Probability Distributions (Normal, Binomial, Poisson).

{{doneWhenLabel}}: You can calculate the probability of an event given new evidence using the Bayesian formula.

5.

Install Anaconda and VS Code

{{whyLabel}}: A standardized environment ensures your code runs reliably and manages all your data science libraries.

{{howLabel}}:

Download and install the 'Anaconda Distribution' (Individual Edition).
Install 'Visual Studio Code' and the 'Python' extension.
Create your first virtual environment using the command: conda create -n ds_env python=3.11.

{{doneWhenLabel}}: You have a working Python environment and can launch a Jupyter Notebook from VS Code.

6.

Master Python Fundamentals for Data

{{whyLabel}}: You need to automate tasks and handle data structures before using advanced libraries.

{{howLabel}}:

Practice Lists, Dictionaries, and Sets.
Write 'For' loops and 'If-Else' logic.
Learn List Comprehensions for concise data processing.
Define Functions with arguments and return values.

{{doneWhenLabel}}: You can write a script that processes a list of numbers to return only the even ones squared.

7.

Query Data with SQL Basics

{{whyLabel}}: Most company data lives in databases; SQL is the only way to retrieve it.

{{howLabel}}:

Learn SELECT, FROM, and WHERE clauses.
Master JOINs (Inner, Left, Right) to combine tables.
Use GROUP BY and Aggregate functions (SUM, AVG, COUNT).
Practice on 'SQLZoo' or 'Mode Analytics' free tutorials.

{{doneWhenLabel}}: You can write a query to find the top 5 customers by total spend from two related tables.

8.

Manipulate Data with Pandas and NumPy

{{whyLabel}}: Pandas is the industry standard for tabular data; NumPy handles the underlying numerical math.

{{howLabel}}:

Learn to load CSV files into a 'DataFrame'.
Practice filtering, sorting, and grouping data.
Handle missing values using .fillna() or .dropna().
Use NumPy arrays for fast mathematical operations on columns.

{{doneWhenLabel}}: You can load a dataset and calculate the average value of a column grouped by a specific category.

9.

Create Visualizations with Matplotlib and Seaborn

{{whyLabel}}: Visuals reveal patterns (like correlations) that raw numbers cannot show.

{{howLabel}}:

Build Line charts for trends and Bar charts for comparisons.
Use Histograms to see the distribution of a single variable.
Create Scatter plots to find relationships between two variables.
Use Seaborn's heatmap() to visualize correlation matrices.

{{doneWhenLabel}}: You have a notebook showing a correlation between two variables in a sample dataset.

10.

Perform Exploratory Data Analysis (EDA)

{{whyLabel}}: EDA is the 'detective work' where you find errors, outliers, and interesting trends before modeling.

{{howLabel}}:

Check for data types and null values using .info() and .describe().
Identify outliers using Box plots.
Check for class imbalance (e.g., are 99% of transactions 'not fraud'?).
Formulate 3 hypotheses about the data and test them with visuals.

{{doneWhenLabel}}: You have a completed EDA report on a dataset (e.g., Titanic or House Prices).

11.

Build a Linear Regression Model

{{whyLabel}}: Regression is the foundation of predictive modeling, used for forecasting continuous values like prices.

{{howLabel}}:

Split data into 'Training' and 'Testing' sets using train_test_split.
Train a model using Scikit-Learn's LinearRegression().
Evaluate using Mean Squared Error (MSE) and R-squared.

{{doneWhenLabel}}: You have a model that predicts a numerical value with an R-squared score.

12.

Implement Classification with Logistic Regression

{{whyLabel}}: Classification is used for binary outcomes (Yes/No, Spam/Not Spam).

{{howLabel}}:

Understand the Sigmoid function.
Train a LogisticRegression model.
Create a Confusion Matrix to see True Positives vs False Positives.

{{doneWhenLabel}}: You can explain the difference between Precision and Recall for your model.

13.

Master Decision Trees and Random Forests

{{whyLabel}}: Tree-based models are highly powerful and handle non-linear data better than simple regression.

{{howLabel}}:

Learn how a Decision Tree splits data based on 'Entropy' or 'Gini Impurity'.
Use 'Random Forest' (an ensemble of trees) to improve accuracy and reduce overfitting.
Visualize a single tree using plot_tree.

{{doneWhenLabel}}: You have trained a Random Forest and identified the 'Feature Importance' (which variables matter most).

14.

Apply Unsupervised Learning (Clustering)

{{whyLabel}}: Sometimes you have data but no labels; clustering finds hidden groups (e.g., customer segments).

{{howLabel}}:

Use the K-Means algorithm.
Find the optimal number of clusters using the 'Elbow Method'.
Visualize the clusters in a scatter plot.

{{doneWhenLabel}}: You have grouped a dataset into 3-5 distinct clusters based on similarity.

15.

Complete an End-to-End Kaggle Project

{{whyLabel}}: Real-world data is messy; finishing a project proves you can handle the full pipeline from cleaning to prediction.

{{howLabel}}:

Choose a dataset (e.g., 'Telco Customer Churn' or 'NYC Taxi Trips').
Perform cleaning, EDA, Feature Engineering, and Modeling.
Document every step in a clean Jupyter Notebook.

{{doneWhenLabel}}: You have a final notebook that achieves a specific accuracy or error metric.

16.

Build a GitHub Portfolio

{{whyLabel}}: GitHub is your 'technical resume' where recruiters verify your coding ability.

{{howLabel}}:

Create a GitHub account.
Learn basic Git commands: git init, add, commit, push.
Upload your best 2-3 projects with a professional README.md explaining the problem, solution, and results.

{{doneWhenLabel}}: Your GitHub profile has at least two repositories with clear documentation.

17.

Optimize your Data Science Resume

{{whyLabel}}: You need to translate your technical projects into business value to get interviews.

{{howLabel}}:

Use the 'X-Y-Z' formula: 'Accomplished X as measured by Y, by doing Z'.
Highlight specific tools (Python, SQL, Scikit-Learn).
Include a link to your GitHub and LinkedIn.

{{doneWhenLabel}}: You have a one-page PDF resume tailored for Junior Data Scientist roles.

Learn data science from scratch

Projekt-Plan