Python Data Science
Python has become the leading language for data science due to its simplicity, versatility, and the rich ecosystem of libraries specifically designed for data analysis, visualization, and machine learning. This guide covers the core libraries and techniques used in the Python data science workflow.
NumPy: Numerical Computing
NumPy is the fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently.
# Install NumPy # pip install numpy import numpy as np # Creating arrays arr1 = np.array([1, 2, 3, 4, 5]) # 1D array arr2 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # 2D array # Array creation functions zeros = np.zeros((3, 4)) # Array of zeros ones = np.ones((2, 3, 4)) # Array of ones empty = np.empty((2, 3)) # Uninitialized array arange = np.arange(10, 30, 5) # [10, 15, 20, 25] linspace = np.linspace(0, 1, 5) # 5 evenly spaced values between 0 and 1 random = np.random.random((2, 2)) # Random values between 0 and 1 identity = np.eye(3) # 3x3 identity matrix # Array attributes print(f"Shape: {arr2.shape}") # (3, 3) print(f"Dimensions: {arr2.ndim}") # 2 print(f"Size: {arr2.size}") # 9 print(f"Data type: {arr2.dtype}") # int64 # Indexing and slicing print(arr2[0, 0]) # 1 (first element) print(arr2[0, :]) # [1, 2, 3] (first row) print(arr2[:, 0]) # [1, 4, 7] (first column) print(arr2[0:2, 1:3]) # [[2, 3], [5, 6]] (sub-matrix) # Reshaping arrays arr3 = np.arange(12) arr3_reshaped = arr3.reshape(3, 4) arr3_flattened = arr3_reshaped.flatten() # Basic operations a = np.array([10, 20, 30, 40]) b = np.array([1, 2, 3, 4]) # Element-wise operations print(a + b) # [11, 22, 33, 44] print(a - b) # [9, 18, 27, 36] print(a * b) # [10, 40, 90, 160] print(a / b) # [10., 10., 10., 10.] print(a ** 2) # [100, 400, 900, 1600] # Matrix operations A = np.array([[1, 2], [3, 4]]) B = np.array([[5, 6], [7, 8]]) print(A.dot(B)) # Matrix multiplication print(np.dot(A, B)) # Alternative syntax print(A @ B) # Python 3.5+ syntax # Statistical functions data = np.array([1, 2, 3, 4, 5]) print(f"Sum: {np.sum(data)}") print(f"Mean: {np.mean(data)}") print(f"Median: {np.median(data)}") print(f"Standard deviation: {np.std(data)}") print(f"Minimum: {np.min(data)}") print(f"Maximum: {np.max(data)}") # Broadcasting # NumPy can automatically handle operations between arrays of different shapes grid = np.zeros((3, 3)) row = np.array([1, 2, 3]) grid = grid + row # Row is broadcast to all rows of grid # Conditional operations values = np.array([1, 2, 3, 4, 5, 6]) even_mask = (values % 2 == 0) # [False, True, False, True, False, True] even_values = values[even_mask] # [2, 4, 6] values_clipped = np.clip(values, 2, 5) # [2, 2, 3, 4, 5, 5]
Pandas: Data Analysis
Pandas is a powerful library for data manipulation and analysis. It provides data structures like DataFrames (tabular data) and Series (one-dimensional arrays), along with functions to efficiently process and analyze data.
# Install Pandas # pip install pandas import pandas as pd import numpy as np # Creating Series (1D labeled arrays) s = pd.Series([1, 3, 5, np.nan, 6, 8]) print(s) # Creating DataFrames (2D labeled data structure) # From a dictionary data = { 'Name': ['John', 'Anna', 'Peter', 'Linda'], 'Age': [28, 34, 29, 42], 'City': ['New York', 'Paris', 'Berlin', 'London'] } df1 = pd.DataFrame(data) print(df1) # From a NumPy array dates = pd.date_range('20230101', periods=6) df2 = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) print(df2) # Reading data # df_csv = pd.read_csv('file.csv') # df_excel = pd.read_excel('file.xlsx') # df_sql = pd.read_sql('SELECT * FROM table', connection) # df_json = pd.read_json('file.json') # Viewing data print(df1.head()) # First 5 rows print(df1.tail(2)) # Last 2 rows print(df1.describe()) # Summary statistics print(df1.info()) # DataFrame info # Accessing data # By column print(df1['Name']) # Single column print(df1[['Name', 'Age']]) # Multiple columns # By row (iloc for integer position, loc for label) print(df1.iloc[0]) # First row print(df1.iloc[0:2]) # First two rows print(df1.loc[df1['Age'] > 30]) # Conditional selection # Data manipulation # Adding columns df1['Country'] = ['USA', 'France', 'Germany', 'UK'] df1['Birth Year'] = 2023 - df1['Age'] # Modifying data df1.loc[0, 'Age'] = 29 df1['Age'] = df1['Age'] + 1 # Increment all ages # Filtering data young_people = df1[df1['Age'] < 35] europeans = df1[df1['Country'].isin(['France', 'Germany', 'UK'])] # Handling missing data df3 = pd.DataFrame({ 'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8], 'C': [9, 10, 11, 12] }) print(df3.isna()) # Identify missing values print(df3.dropna()) # Drop rows with any missing values print(df3.fillna(value=0)) # Fill missing values with 0 # Grouping and aggregation df4 = pd.DataFrame({ 'Category': ['A', 'A', 'B', 'B', 'C', 'C'], 'Value1': [10, 15, 20, 25, 30, 35], 'Value2': [100, 150, 200, 250, 300, 350] }) # Group by Category and calculate statistics grouped = df4.groupby('Category') print(grouped.mean()) # Mean of each group print(grouped.agg(['min', 'max'])) # Min and max of each group # Custom aggregation print(grouped.agg({ 'Value1': ['min', 'max', 'mean'], 'Value2': ['sum', 'mean'] })) # Merging and joining left = pd.DataFrame({ 'key': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4] }) right = pd.DataFrame({ 'key': ['B', 'D', 'E', 'F'], 'value': [5, 6, 7, 8] }) # Different types of joins inner_join = pd.merge(left, right, on='key', how='inner') left_join = pd.merge(left, right, on='key', how='left') outer_join = pd.merge(left, right, on='key', how='outer') # Reshaping data df5 = pd.DataFrame({ 'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'], 'Variable': ['X', 'Y', 'X', 'Y'], 'Value': [1, 3, 2, 4] }) # Wide format (pivot) wide = df5.pivot(index='Date', columns='Variable', values='Value') print(wide) # Long format (melt) long = wide.reset_index().melt(id_vars=['Date'], value_vars=['X', 'Y']) print(long) # Time series ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2023', periods=1000)) print(ts.resample('M').mean()) # Monthly resampling print(ts.shift(2)) # Shift data by 2 periods print(ts.rolling(window=7).mean()) # 7-day rolling average # Visualization (requires matplotlib) # df1.plot(kind='bar') # df2.plot(kind='line') # df1['Age'].plot(kind='hist')
Data Visualization
Data visualization is crucial for understanding patterns, trends, and relationships in data. Python offers several powerful libraries for creating visualizations, with Matplotlib and Seaborn being the most popular ones.
Matplotlib
# Install matplotlib # pip install matplotlib import matplotlib.pyplot as plt import numpy as np # Basic line plot x = np.linspace(0, 10, 100) y = np.sin(x) plt.figure(figsize=(10, 6)) plt.plot(x, y, label='sin(x)') plt.title('Sine Function') plt.xlabel('x') plt.ylabel('sin(x)') plt.grid(True) plt.legend() # plt.savefig('sine_plot.png') # plt.show() # Multiple plots on the same figure plt.figure(figsize=(10, 6)) plt.plot(x, np.sin(x), label='sin(x)') plt.plot(x, np.cos(x), label='cos(x)') plt.title('Sine and Cosine Functions') plt.xlabel('x') plt.ylabel('y') plt.grid(True) plt.legend() # plt.show() # Subplots fig, axes = plt.subplots(2, 2, figsize=(12, 8)) axes[0, 0].plot(x, np.sin(x)) axes[0, 0].set_title('Sine') axes[0, 1].plot(x, np.cos(x)) axes[0, 1].set_title('Cosine') axes[1, 0].plot(x, np.tan(x)) axes[1, 0].set_title('Tangent') axes[1, 0].set_ylim(-5, 5) # Limit y-axis for better visibility axes[1, 1].plot(x, x**2) axes[1, 1].set_title('Quadratic') # Adjust layout plt.tight_layout() # plt.show() # Different plot types # Scatter plot plt.figure(figsize=(8, 6)) x = np.random.rand(50) y = np.random.rand(50) colors = np.random.rand(50) sizes = 1000 * np.random.rand(50) plt.scatter(x, y, c=colors, s=sizes, alpha=0.7) plt.title('Scatter Plot') plt.colorbar() # plt.show() # Bar plot categories = ['A', 'B', 'C', 'D', 'E'] values = [3, 7, 2, 5, 8] plt.figure(figsize=(8, 6)) plt.bar(categories, values, color='skyblue') plt.title('Bar Plot') plt.xlabel('Categories') plt.ylabel('Values') # plt.show() # Histogram data = np.random.randn(1000) # 1000 random samples from normal distribution plt.figure(figsize=(8, 6)) plt.hist(data, bins=30, color='skyblue', edgecolor='black', alpha=0.7) plt.title('Histogram') plt.xlabel('Value') plt.ylabel('Frequency') # plt.show() # Pie chart labels = ['Python', 'Java', 'JavaScript', 'C++', 'Other'] sizes = [45, 15, 20, 10, 10] explode = (0.1, 0, 0, 0, 0) # Explode the 1st slice plt.figure(figsize=(8, 8)) plt.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle plt.title('Programming Languages') # plt.show() # Box plot data = [np.random.normal(0, std, 100) for std in range(1, 4)] plt.figure(figsize=(8, 6)) plt.boxplot(data, vert=True, patch_artist=True) plt.title('Box Plot') plt.xlabel('Group') plt.ylabel('Value') # plt.show() # 3D plot from mpl_toolkits.mplot3d import Axes3D fig = plt.figure(figsize=(10, 8)) ax = fig.add_subplot(111, projection='3d') # Create the mesh in polar coordinates and compute corresponding Z r = np.linspace(0, 1.25, 50) p = np.linspace(0, 2*np.pi, 50) R, P = np.meshgrid(r, p) Z = R**2 * np.sin(P) # Express the mesh in the cartesian system X, Y = R*np.cos(P), R*np.sin(P) # Plot the surface surf = ax.plot_surface(X, Y, Z, cmap=plt.cm.YlGnBu_r) # Adjust the viewing angle ax.view_init(40, 45) plt.colorbar(surf) # plt.show() # Customizing plots plt.figure(figsize=(10, 6)) plt.plot(x, np.sin(x), 'r-', linewidth=2, label='sin(x)') plt.plot(x, np.cos(x), 'b--', linewidth=2, label='cos(x)') plt.grid(True, linestyle='--', alpha=0.7) plt.title('Customized Plot', fontsize=16) plt.xlabel('x', fontsize=14) plt.ylabel('y', fontsize=14) plt.xticks(fontsize=12) plt.yticks(fontsize=12) plt.legend(fontsize=12) # Add text annotation plt.annotate('Local maximum', xy=(1.5, 1), xytext=(2, 1.4), arrowprops=dict(facecolor='black', shrink=0.05)) # Add a vertical line plt.axvline(x=np.pi/2, color='green', linestyle='--', alpha=0.7) # Save with high resolution # plt.savefig('custom_plot.png', dpi=300, bbox_inches='tight') # plt.show()
Seaborn
# Install seaborn # pip install seaborn import seaborn as sns import matplotlib.pyplot as plt import pandas as pd import numpy as np # Set the styling sns.set(style="whitegrid") # Sample data tips = sns.load_dataset("tips") flights = sns.load_dataset("flights") iris = sns.load_dataset("iris") # Basic plotting with Seaborn # Distribution plots plt.figure(figsize=(10, 6)) sns.histplot(tips['total_bill'], kde=True) plt.title('Distribution of Total Bill') # plt.show() # Scatter plot with categorical data plt.figure(figsize=(10, 6)) sns.scatterplot(x="total_bill", y="tip", hue="time", size="size", data=tips) plt.title('Tips vs Total Bill by Time of Day') # plt.show() # Categorical plots plt.figure(figsize=(10, 6)) sns.boxplot(x="day", y="total_bill", hue="sex", data=tips) plt.title('Total Bill by Day and Gender') # plt.show() # Violin plot - combines box plot with kernel density estimate plt.figure(figsize=(10, 6)) sns.violinplot(x="day", y="total_bill", hue="sex", split=True, data=tips) plt.title('Total Bill by Day and Gender (Violin Plot)') # plt.show() # Count plot - shows the counts of observations plt.figure(figsize=(10, 6)) sns.countplot(x="day", hue="sex", data=tips) plt.title('Count of Orders by Day and Gender') # plt.show() # Pair plot - grid of plots for multiple variables sns.pairplot(iris, hue="species", height=2.5) plt.suptitle('Pair Plot of Iris Dataset', y=1.02) # plt.show() # Heat map - useful for correlation matrices plt.figure(figsize=(10, 8)) correlation = iris.drop('species', axis=1).corr() sns.heatmap(correlation, annot=True, cmap='coolwarm') plt.title('Correlation Matrix of Iris Features') # plt.show() # Regression plot - linear regression with confidence interval plt.figure(figsize=(10, 6)) sns.regplot(x="sepal_length", y="petal_length", data=iris) plt.title('Sepal Length vs Petal Length (with Regression Line)') # plt.show() # Facet grid - multiple plots organized by different variables g = sns.FacetGrid(tips, col="sex", row="smoker", height=4) g.map(sns.scatterplot, "total_bill", "tip") g.add_legend() # plt.show() # Joint plot - combines bivariate and univariate plots sns.jointplot(x="sepal_length", y="sepal_width", data=iris, kind="hex") # plt.show() # Cluster map - hierarchical clustering with heat map plt.figure(figsize=(12, 10)) iris_no_species = iris.drop('species', axis=1) sns.clustermap(iris_no_species, standard_scale=1, cmap='coolwarm') # plt.show() # Time series visualization # Reshape the flights data for plotting flights_pivot = flights.pivot(index="month", columns="year", values="passengers") plt.figure(figsize=(12, 8)) sns.heatmap(flights_pivot, cmap="YlGnBu", annot=True, fmt="d") plt.title('Number of Passengers per Month (1949-1960)') # plt.show() # Line plot for time series plt.figure(figsize=(12, 6)) sns.lineplot(x="year", y="passengers", hue="month", data=flights) plt.title('Passenger Numbers by Year and Month') # plt.show() # Setting different themes themes = ["darkgrid", "whitegrid", "dark", "white", "ticks"] fig, axes = plt.subplots(1, len(themes), figsize=(20, 5)) for i, theme in enumerate(themes): sns.set_theme(style=theme) ax = axes[i] sns.lineplot(x="sepal_length", y="sepal_width", hue="species", data=iris, ax=ax) ax.set_title(f"Theme: {theme}") ax.set_xlabel("") ax.set_ylabel("") plt.tight_layout() # plt.show()
Machine Learning with Scikit-learn
Scikit-learn is a powerful machine learning library in Python that provides simple and efficient tools for data analysis and modeling. It includes various classification, regression, and clustering algorithms along with tools for model selection, preprocessing, and evaluation.
# Install scikit-learn # pip install scikit-learn import numpy as np import pandas as pd from sklearn import datasets from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.impute import SimpleImputer from sklearn.metrics import accuracy_score, confusion_matrix, classification_report from sklearn.metrics import mean_squared_error, r2_score import matplotlib.pyplot as plt import seaborn as sns # Load a dataset iris = datasets.load_iris() X = iris.data y = iris.target # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Data Preprocessing # Standardize features (mean=0, variance=1) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Classification # Logistic Regression from sklearn.linear_model import LogisticRegression # Create and train the model logreg = LogisticRegression(max_iter=200, random_state=42) logreg.fit(X_train_scaled, y_train) # Make predictions y_pred = logreg.predict(X_test_scaled) # Evaluate the model print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}") print("\nConfusion Matrix:") cm = confusion_matrix(y_test, y_pred) print(cm) print("\nClassification Report:") print(classification_report(y_test, y_pred, target_names=iris.target_names)) # Other common classification algorithms # K-Nearest Neighbors from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X_train_scaled, y_train) knn_pred = knn.predict(X_test_scaled) print(f"KNN Accuracy: {accuracy_score(y_test, knn_pred):.4f}") # Support Vector Machine from sklearn.svm import SVC svm = SVC(kernel='rbf', random_state=42) svm.fit(X_train_scaled, y_train) svm_pred = svm.predict(X_test_scaled) print(f"SVM Accuracy: {accuracy_score(y_test, svm_pred):.4f}") # Decision Tree from sklearn.tree import DecisionTreeClassifier dt = DecisionTreeClassifier(random_state=42) dt.fit(X_train_scaled, y_train) dt_pred = dt.predict(X_test_scaled) print(f"Decision Tree Accuracy: {accuracy_score(y_test, dt_pred):.4f}") # Random Forest from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train_scaled, y_train) rf_pred = rf.predict(X_test_scaled) print(f"Random Forest Accuracy: {accuracy_score(y_test, rf_pred):.4f}") # Regression # Load Boston Housing dataset boston = datasets.load_boston() X_boston = boston.data y_boston = boston.target X_train_boston, X_test_boston, y_train_boston, y_test_boston = train_test_split( X_boston, y_boston, test_size=0.2, random_state=42) # Standardize features scaler_boston = StandardScaler() X_train_boston_scaled = scaler_boston.fit_transform(X_train_boston) X_test_boston_scaled = scaler_boston.transform(X_test_boston) # Linear Regression from sklearn.linear_model import LinearRegression lr = LinearRegression() lr.fit(X_train_boston_scaled, y_train_boston) lr_pred = lr.predict(X_test_boston_scaled) # Evaluate the model mse = mean_squared_error(y_test_boston, lr_pred) r2 = r2_score(y_test_boston, lr_pred) print(f"Linear Regression MSE: {mse:.4f}") print(f"Linear Regression R²: {r2:.4f}") # Other regression algorithms # Ridge Regression (L2 regularization) from sklearn.linear_model import Ridge ridge = Ridge(alpha=1.0) ridge.fit(X_train_boston_scaled, y_train_boston) ridge_pred = ridge.predict(X_test_boston_scaled) print(f"Ridge Regression R²: {r2_score(y_test_boston, ridge_pred):.4f}") # Lasso Regression (L1 regularization) from sklearn.linear_model import Lasso lasso = Lasso(alpha=0.1) lasso.fit(X_train_boston_scaled, y_train_boston) lasso_pred = lasso.predict(X_test_boston_scaled) print(f"Lasso Regression R²: {r2_score(y_test_boston, lasso_pred):.4f}") # Clustering # K-Means from sklearn.cluster import KMeans # Use only sepal length and width for visualization X_cluster = X[:, :2] kmeans = KMeans(n_clusters=3, random_state=42) clusters = kmeans.fit_predict(X_cluster) # Visualize the clusters plt.figure(figsize=(10, 6)) plt.scatter(X_cluster[:, 0], X_cluster[:, 1], c=clusters, cmap='viridis') plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X', label='Centroids') plt.xlabel(iris.feature_names[0]) plt.ylabel(iris.feature_names[1]) plt.title('K-Means Clustering of Iris Dataset') plt.legend() # plt.show() # Hierarchical Clustering from sklearn.cluster import AgglomerativeClustering hc = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward') hc_clusters = hc.fit_predict(X_cluster) # Feature Importance # Using Random Forest feature_importance = pd.DataFrame({ 'Feature': iris.feature_names, 'Importance': rf.feature_importances_ }).sort_values('Importance', ascending=False) print("\nFeature Importance:") print(feature_importance) # Cross-validation # K-fold cross-validation cv_scores = cross_val_score(logreg, X, y, cv=5) print(f"\nLogistic Regression CV Scores: {cv_scores}") print(f"Average CV Score: {np.mean(cv_scores):.4f}") # Hyperparameter Tuning # Grid Search param_grid = { 'C': [0.01, 0.1, 1, 10, 100], 'solver': ['liblinear', 'lbfgs'] } grid_search = GridSearchCV(LogisticRegression(max_iter=200, random_state=42), param_grid, cv=5, scoring='accuracy') grid_search.fit(X, y) print("\nBest Hyperparameters:", grid_search.best_params_) print(f"Best Score: {grid_search.best_score_:.4f}") # Pipelines # Creating a preprocessing and modeling pipeline pipe = Pipeline([ ('scaler', StandardScaler()), ('classifier', LogisticRegression(random_state=42)) ]) pipe.fit(X_train, y_train) pipe_pred = pipe.predict(X_test) print(f"\nPipeline Accuracy: {accuracy_score(y_test, pipe_pred):.4f}") # Working with mixed data types (categorical and numerical) # For demonstration, creating a mixed dataset # categorical_features = ['cat1', 'cat2'] # numerical_features = ['num1', 'num2', 'num3'] # preprocessor = ColumnTransformer( # transformers=[ # ('num', Pipeline([ # ('imputer', SimpleImputer(strategy='median')), # ('scaler', StandardScaler()) # ]), numerical_features), # ('cat', Pipeline([ # ('imputer', SimpleImputer(strategy='most_frequent')), # ('onehot', OneHotEncoder(handle_unknown='ignore')) # ]), categorical_features) # ] # ) # Processing with a pipeline # full_pipeline = Pipeline([ # ('preprocessor', preprocessor), # ('classifier', RandomForestClassifier(random_state=42)) # ]) # Model Persistence import joblib # Save the model joblib.dump(logreg, 'logreg_model.pkl') # Load the model loaded_model = joblib.load('logreg_model.pkl') loaded_pred = loaded_model.predict(X_test_scaled) print(f"\nLoaded Model Accuracy: {accuracy_score(y_test, loaded_pred):.4f}")
Best Practices in Data Science
Following best practices in data science projects ensures reproducibility, maintainability, and efficiency.
Project Structure
- Organize your code in modular scripts or packages
- Separate data acquisition, processing, modeling, and evaluation
- Use version control (Git) for your code
- Create clear documentation
- Use virtual environments to manage dependencies
Data Preprocessing
- Always explore your data before modeling
- Handle missing values appropriately
- Check for and handle outliers
- Scale features when needed
- Split data into training, validation, and test sets
Modeling
- Start with simple models before trying complex ones
- Use cross-validation to evaluate model performance
- Tune hyperparameters systematically
- Evaluate models using appropriate metrics
- Watch out for overfitting and underfitting
Reproducibility
- Set random seeds for reproducibility
- Use configuration files for parameters
- Document data sources and transformations
- Consider containerization (e.g., Docker) for environment consistency
Practice Exercises
Practice with these exercises to improve your data science skills:
- Data Exploration: Download a dataset from Kaggle or UCI Machine Learning Repository and perform exploratory data analysis using Pandas and visualization libraries.
- Feature Engineering: Create new features from an existing dataset that improve the performance of a machine learning model.
- Model Comparison: Implement and compare the performance of at least three different machine learning algorithms on the same dataset.
- Time Series Analysis: Find a time series dataset and build a model to forecast future values.
- Pipeline Construction: Build a complete data science pipeline from data loading to preprocessing, modeling, and evaluation.