Index
Data Mining and Machine Learning
Modules:
-
Data Mining: Prof. Golfarelli - 36h
- introduction to data mining
- Knowledge discovery process
- Understanding and preparing data
- Data mining techniques
- Data understanding and validation
- Weka software
- Case studies analysis
-
Machine Learning: Prof. Guido Borghi - 18h
- Introduction to AI
- Machine Learning and Deep Learning
- Data acquisition and Processing
- Model Training
- Metrics
- LIBRARIES:
- Scikit-learn (ML)
- Tensofrflow (DL)
Assessment Method - EXAM
The exam consists in an oral exam on all the subjects (80%) of the course and an elaborate (20% - agreed with the teacher).
The elaborate must be carried out in the Machine Learning module, choosing between:
- Study and algorithm among those in the literature
- Analysis of a data set with mining techniques
There are no fixed dates for the exam, it can be defined with teachers along the whole academic year. The two modules must be discussed within 15 days.
Data Mining ↵
Data Mining
The amount of data stored on computer is constantly increasing, coming from:
- IoT data
- Social data
- Data on purchases
- Banking and credit card transaction
The first step is to collect data in a data set. This step can be automated through artificial intelligence increasing the analytical power.
From on side, data is more and more and on the other side, hardware becomes more powerful and cheaper each day.
At the same time, managers are more and more willing to rely on data analysis for their business decisions. The information resource is a precious asset to overcoming competitors.
Artificial Intelligence, Machine Learning and Data Mining
Although strongly interrelates, the term machine learning is formally distinct from the term Data Mining which indicates the computational process of pattern discovery in large datasets using machine learning methods, artificial intelligence, statistics and databases.
Data Mining - definition
Complex extraction of implicit, previously unknown and potentially useful data from the information. Exploration and analysis, using automated and semi-automatic systems, of large amounts of data in order to find significant patterns through statistics.
We do not just need to find results, but we need results to be USEFUL.
Analytics
Analytics refers to software used to discover, understand and share relevant pattern in data. Analytics are based on the concurrent use of statistics, machine learning and operational research techniques, often exploiting visualization techniques.

Prescriptive systems generate much value but it is extremely complex. Companies should start simple, adopting simple descriptive analytics solutions, and then move on. It is risky to skip intermediate steps.
BI adoption path
When we decide to digitalize a company, the adoption of BI solutions is incremental and rarely allows steps to be skipped. This is because it is risky, costly and useless to adopts advanced solutions before completely exploiting simple ones.
The goal is to create a data-driven company, where managers are supported by data.
- Decisions are based on quantitative rather than qualitative knowledge.
- Process and knowledge are an asset of the company and are not lost if managers change
The gap between a data-driven decision and a good decision is a good manager
Adopting a data-driven mindset goes far beyond adopting a business intelligence solution and entails:
- Create a data culture
- Change the mindset of managers
- Change processes
- Improve the quality of all the data
Digitalization is a journey that involves three main dimensions:

Pattern
A pattern is a synthetic representation rich in semantics of a set of data. It usually expresses a recurring pattern in data, but can also express an exceptional pattern.
A pattern must be:
- Valid on data with a certain degree of confidence
- It can be understood from the syntax and semantic point of view, so that the user can interpret it
- Previously unknown and potentially useful, so that users can take actions accordingly
When we distinguish between a manual technique (DW) and an automatic technique is the creation of a small subset of data which is rich in semantics.
The process begins with a huge multi-dimension cube of data, then grouping and selection techniques are adopted, creating a pattern.
Pattern types:
- Association rules (logical implications of the dataset)
- Classifiers (classify data according to a set of priori assigned classes)
- Decision trees (identify the causes that lead to an event, in order of importance)
- Clustering (group elements depending on their characteristics)
- Time series (detection of recurring or atypical patterns in complex data sequences)
Data Mining Applications
Predictive Systems
Exploit some features to predict the unknown values of other features (classification and regression).
Descriptive Systems
Find user-readable patterns that can be understood by human users (clustering, association rules, sequential pattern).
Classification - Definition
Given a record set, where each record is composed by a set of attributes (one of them represents the class of the record), find a model for the class attribute expressing the attribute value as a function of the remaining attributes.
Given a feature (defined at priori), define weather a user belongs to that feature
This model must work even when the record is not given. Unclassified record must be assigned to a class in the most accurate way.
A test set is used to determine the model accuracy.

Classification example
Direct Marketing: The goal is to reduce the cost of email marketing by defining the set of customers that, with the highest probability, will buy a new product.
Technique:
- Exploit the data collected during the launch of similar products
- We know which customers bought and which one did not
- {buy, not buy} = class attribute
- Collect all the available information about each customers
- Use such information as an input to train the model
Churn Detection Predict customers who are willing to go to a competitor.
Technique:
- Use the purchasing data of individual users to find the relevant attributes
- Label users as {loyal, not loyal}
- Find a pattern that defines loyalty
Clustering example
Given a set of points, each featuring set of attributes, and having a similarity measure between points, find subset of points such that: points belonging to a cluster are more similar to each other than those belonging to other clusters
Marketing Segmentation
The goal is to split customers into distinct subsets to target specific marketing activities.
Techniques:
- Gather information about customer lifestyle and geographic location
- Find clusters of similar customers
- Measure cluster quality by verifying whether the purchasing patterns of customers belonging to the same cluster are more similar to those of distinct clusters
Association Rules example
Given a set of records each consisting of multiple elements belonging to a given collection. It produces rules of dependence that predict the occurrence of one of the elements in the presence of others.
Marketing Sales Promotion Suppose you have discovered this association rule: {Bagels,...} -> {Potato chips*}
This information can be used to understand what actions to take to increase its sales.
Data Mining Bets
- Scalability
- Multidimensionality of data set
- Complexity and heterogeneity of the data
- Data quality
- Data properties
- Privacy keeping
- Processing in real-time
CRISP methodology
A data mining project requires a structured approach in order to choose the best algorithm.
CRISP-DM methodology is the most used technique. It is one of the most structured proposals to define the fundamental steps of a data mining project.

The six stages of the life cycle are not strictly sequential, indeed, it is often necessary.
- Business understanding (understand the application domain): understanding project goals from users' point of view, translate the user's problem into a data mining problem and define a project plan.
- Get an idea about the business domain and the data mining approach to adopt.
- Data understanding: preliminary data collection aimed at identifying quality problems and conducting preliminary analysis to identify the salient characteristics.
- Data preparation: tasks needed to create the final dataset, selecting attributes and records, transforming and cleaning data.
- Prepare the data for ML tasks (clean, complete missing data, create new features)
- Model creation: data mining techniques are applied to the dataset in order to identify what makes the model more accurate.
- Evaluation of model and results: the model obtained from the previous phase are analyzed to verify that they are sufficiently precise and robust to respond adequately to the user's objectives.
- Deployment: the built-in model and acquired knowledge must be made available to users.
- Change the software and processes to include new AI functionalities
Different classes of data mining use different algorithms so the evaluation changes accordingly.
Customer Retention
Customer retention, churn analysis, dropout analysis are synonyms for predictive analysis carried out by organizations and companies to avoid losing customers.
The idea is to create a different profile for customers who stay and customers who drop-out.
The Gym Case Study
They discovered that customers who did not train well, eventually drop out from the gym. Therefore, the goal was to model customers' training sessions in order to predict those who did not train well and prevent them from dropping out.
Steps:
- Customers have s list of exercises
- The system records the exercises (and repetition) did during the workout
- The system matches the exercises
- Train a classifier that is able to predict that someone is leaving the gym because he is unsatisfied
- The system update the profile each week
- Four weeks without training = dropout
- The idea of dropout needs to be defined properly (a customer who stops going to the gym in summer and comes back in summer is different from a customers who dropout and does not come back)
Practitioner who is about to leave the gym is training poorly. How can characterize the user behaviors? How long does it last?
Many KPIs can be adopted to assess the training session: in this case, two indicators were identified:
- Compliance (adherence of the performed workout)
- Regularity (regularity of the training sessions with reference to the prescribed one)
We still have a problem of granularity: we can assess regularity by checking steps, repetition, physical activity, muscle or body part.
Ended: Data Mining
Data Understanding ↵
Data Understanding & Preparation
In data mining, data are composed of collections of objects described by a set of attributes (we refer to data that can be stored in a database).
Attribute: property characteristic of an object
Attribute types
In order to perform meaningful analysis, the characteristics of the attributes must be known. The attribute type tells us what properties are reflected in the value we use as a measure.
We can identify 4 types of attributes:
- Nominal-qualitative: different names of value (gender, zip code, ID)
- Ordinal-qualitative: values enables us to sort objects based on the value of attribute (grade)
- Interval-quantitative: the difference between the values has a meaning, with a unit of measurement (dates, temperature)
- Ratio-quantitative: the ratio of values has meaning (age, length, amount of money)
Further classifications
- Binary, discrete and continuous
- Discrete: finite number of infinite countable set of values
- Continuous: real values
Nominal and ordinal are typically discrete or binary, while interval and ratio attributes are continuous - Asymmetric attribute: only instances that take non-zero values are relevant - Documents and Texts: objects of the analysis described by a vector of terms - Transactions - Each record involves multiple items - Items come from a finite set - The number of items may vary from transaction to transaction - Ordered data
Explorative Analysis
First step in business and data understanding. It refers to the preliminary analysis of the data aimed at identify its main characteristics.
- It helps you choose the best tool for processing and analysis
STATISTICS OVERVIEW
Frequency
The frequency of an attribute value is the percentage of times that value appears in the data set.
Mode
The mode of an attribute is the value that appears most frequently in the data set.
Percentile
Given an ordinal or continuous attribute x and a number p between 0 and 100, the p-th percentile is the value of xp of x such that p% of the observed values for x are lower than xp.

Percentile visualization through boxplot enables the representation of a distribution of data. It can be used to compare multiple distributions when they have homogeneous magnitude.
Mean
The mean is the most common measure for locating a set of points.
- Subject to outliers
- It is preferred to use tee median or a 'controlled' mean
Median
The median is the term occupying the central place if the terms are odd; if the terms are even, the median is the arithmetic mean of the two central terms.
Range
Range is the difference between the minimum and maximum values taken by the attribute.
Variance and Standard Deviation
Variance and SD are the most common measures of dispersion of a data set.
- Sensitive to outliers since they are quadratically related to the concept of mean

Data Quality
The quality of the datasets profoundly affects the chances of finding meaningful patterns. The most frequent problems that deteriorate data quality are:
- Noise and outliers (objects with characteristics very different from all other objects in the data set)
- Missing values (not collecting the data is different from when the attribute is not applicable), how to handle them:
- Delete the objects that contain them
- Ignore missing values during analysis
- Manually/automatically fill the missing values
- ML can be applied to fill the missing values by inferring the other values of that attribute and calculate the most appropriate value
- Duplicated values (it may be necessary to introduce a data cleaning step in order to identify and eliminate redundancy)
Dataset Preprocessing
Rarely the dataset has the optimal characteristics to be best processed by machine learning algorithms. It is therefore necessary to put in place a series of actions to enable the algorithms of interest to function:
- Aggregation: combine two or more attributes into one attribute
- Sampling: main technique to select data
- Collecting and processing the entire dataset is too expensive and time consuming
- Simple Random Sampling (same probability of selecting each element)
- Stratified sampling (divides the data into multiple partitions and use simple random sampling on each partition)
- Before sampling a partitioning rule is applied (we inject knowledge about the domain)
- Allow the population to be balanced
- However, we are applying a distortion
- Sampling Cardinality: after choosing the sampling mode, it is necessary to fix the sample size in order to limit the loss of information
- Dimensionality reduction: the goal is to avoid the 'curse of dimensionality', reduce the amount of time and memory used by ML algorithms, simplify data visualization and eliminate irrelevant attributes and eliminate noise on data. Curse of dimensionality: as dimensionality increases, the data become progressively more sparse. Many clustering and classification algorithms deal with dimensionality and distances. All the elements become equi-distant from one another; the idea of selecting the right dimension to carry out analysis is crucial.

The curve indicates that the more we increase the number of dimensionality, the smaller the ratio is. In the modeling phase, it is important reduce dimensionality.
The goal is to reduce dimensionality and carry out analysis with the highest information amount.
- Principal Component Analysis: it is a projection method that transforms objects belonging to a p-dimensional space into a k-dimensional space in such way as to preserve maximum information in the initial dimension.
- Attribute creation: it is a way to reduce the dimensionality of data. The selection usually aims to eliminate redundant.
We can use different attribute selection techniques:
- Exhaustive approaches
- Non-exhaustive approaches
- Feature engineering (create new features): we have raw data and we can extract useful KPIs by designing new attributes.
-
Discretization and binarization: transformation of continuos-valued attributes to discrete-valued attributes. Discretization techniques can be unsupervised (do not exploit knowledge about the class to which elements belong) or supervised (exploit knowledge about the class to which the elements belong).
-
Unsupervised: equi-width, equi-frequency, K-means

-
Supervised: discretization intervals are positioned to maximize the 'purity' of the intervals
-
Entropy and Information Gain: it is the measure of uncertainty about the outcome of an experiment that can be modeled by a random variable x. The entropy of a certain event is zero.
The entropy of a discretization into n intervals depends on how pure each group.

Binarization: we start with a discrete attribute but we need it to be binary.
- Attribute transformation: function that maps the entire set of values of an attribute to a new set such that each value in the starting set corresponds to a unique value in the ending set.
Similarity and Dissimilarity
These two concepts are central in Machine Learning, as it is important to group clusters based on similarity and dissimilarity.
Some techniques are stronger with long distances while sometimes, by setting the wrong distance, we will incur in problems.
-
Similarity: it is a numerical measure expressing the degree of similarity between two objects
- Takes values in the range [0, 1]
-
Dissimilarity (distance): it is a numerical measure expressing the degree of difference between two objects
- Takes values in the range [0, 1] or [0, ∞]

Distance

Distance Properties
Given two objects p and q and a dissimilarity measure d():
- d(p,q) = 0 only if p=q
- d(p,q) = d(q,p) -> Symmetry
- d(p,r) + d(p,q) + d(q,r) -> Triangular inequality

Similarity Properties
Given two objects p and q and a similarity measure s():
- s(p,q) = 1 only if p=q
- s(p,q) = s(q,p) -> Symmetry
Binary Vector Similarities
It is common for attributes describing an object to contain only binary values.
- M01 = the number of attributes where p=0 and q=1
- M10 = the number of attributes where p=1 and q=0
- M00 = the number of attributes where p=0 and q=0
- M11 = the number of attributes where p=1 and q=1
Cosine Similarity
Like Jaccard's index. it does not consider 00 matches, but also allows non-binary vectors to be operated on.
Similarity with Heterogeneous Attributes
In the presence of heterogeneous attributes, it is necessary to compute the similarities separately and then combine them so that their result belongs to the range [0, 1]
Correlation
The correlation between pairs of objects described by attributes (binary or continuous) is a measure of the existence of a linear relationship between its attributes.


Ended: Data Understanding
Decision Tree ↵
Decision Tree
It is one of the most widely used classification techniques. It is simple, it can be trained with a limited number of examples, it is understandable and works well with categorical attributes.
The usage of this model is characterized by a set of questions (yes/no), which build the tree. The idea is that the number of possible decision trees is exponential and we are looking for the best one (the one that creates the most accurate representation).
All the classification algorithms are systems that work in a multidimensional space ans try to find some regions that have the same types of object (belonging to the same class).

Learning the Model
Many algorithms are available, but we will use C4.5.
The Haunt's Algorithm It is a recursive approach that progressively subdivides a set of Dt records into purely pure record sets.
Procedure to follow:
- If Dt contains records belonging to the yj class only, then it is a lea node with label yj
- If Dt is an empty set, then t is a leaf node to which a parent node class is assigned
- If Dt contains records belonging to several classes, you choose an attribute and a split policy to partition the records into multiple subsets.
- Apply recursively the current procedure to each subset
TreeGrowth(E,F)
if StoppingCond(E,F) = TRUE then
leaf = CreateNode()
leaf.label = Classify(E) ;
return leaf;
else:
root = CreateNode();
root.test cond = FindBestSplit(E,F) ;
let V = {V | v is a possible outcome of root.test_cond}
for each v ∈ V do
E = {e | root.test cond(e)=v and e ∈ E}
child = TreeGrowth(E,F);
add child as descendant of root and label edge
end for
end if
return root;
end;
Characteristic Feature
Starting from the basic logic to completely define an algorithm for building decision trees, it is necessary to define:
- The split condition (depends on the type of attribute and on the number of splits)
- Nominal (N-ary split vs binary split)
- Ordinal (partitioning should not violate the order sorting)
- Continuous (the split condition can be expressed as a Boolean with N-ary split and as a binary comparison test with binary-split)
- Static (discretization takes place only once before applying the algorithm)
- DYnamic (discretization takes place at each recursion)
- The criterion defining the best split (it must allow you to determine more pure classes, using a measure of purity)
- The criterion for interrupting splitting (AND conditions, if one applies, the splitting stops)
- When all its records belong to the same class
- When all its records have similar values on all attributes
- When the number of records in the node is below a certain threshold
- When the selected criterion would not be statistically relevant
- Methods for evaluating the goodness of a decision tree
Metrics for Model Evaluation
Confusion Matrix evaluates the ability of a classifier based on the following indicators:
- TP (true positive)
- FN (false negative)
- FP (false positive)
- TN (true negative)
Accuracy is the most widely used metric to synthesize the information of a confusion matrix

- Accuracy Limitations
Accuracy is not an appropriate metric if the classes contain a very different number of records.
Precision and Recall are two metric used in applications where the correct classification of positive class records is more important
- Precision measures the fraction of record results actually positive among all those who were classified as such
- Recall measures the fraction of positive records correctly classified

F-measure is a metric that summarizes precision and recall
Cost-Based Evaluation Accuracy, precision, recall and F-measure classify an instance as positive if P(+,i) > P(-,i). They assume that FN and FP have the same weight, thus they are cost-intensive, but in many domains this is not true.

ROC Space (Receiver Operator Characteristics)
Roc graphs are two-dimensional graphs that depict relative tradeoffs between benefits (TP) and costs (FP) induced by a classifier. We distinguish between:
- Probabilistic classifiers return a score that is not necessarily a sensu strictu probability but represents the degree to which an object is a member of one particular class rather than another one
- Discrete classifier predicts only the classes to which a test object belongs

Classification Errors
- Training error: mistakes that are made on the training set
- Generalization error: errors made on the test set
- Underfitting: the model is too simple and does not allow a good classification or set training or test set
- Overfitting: the model is too complex, it allows a good classification of the training set, but a poor classification of the test set
- Due to noise (the boundaries of the areas are distorted)
- Due to the reduced size of the training set
How to handle overfitting
- Pre-pruning: stop splitting before you reach a deep tree. A node can not be split further if:
- Nodes does not contain instances
- All instances belong to the same class
- All attributes have the same values
- Post-pruning: run all possible splits to reduce the generalization error
Post-pruning is more effective but involves more computational cost. It is based on the evidence of the result of a complete tree.
Estimate Generalization Error
A decision tree should minimize the error on the real data set, unfortunately during construction, only the training set is available.
The methods for estimating the generalization error are:
- Optimistic approach
- Pessimistic approach
- Minimum Description Length (choose the model that minimizes the cost to describe a classification)
- Using the test set
Building the Test Set
- Holdout: use 2/3 of training records and 1/3 for validation
- Random subsampling: repeated execution of the holdout method in which the training dataset is randomly selected
- Cross validation: partition the records into separate k subdivisions, run the training on k-1 divisions and test the reminder, repeat the test k times and calculate the average accuracy
- Bootstrap: The extracted records are replaced and records that are excluded form the validation set. This method does not create a new dataset with more information, but it can stabilize the obtained results of the available dataset.
C4.5 (J48 on Weka)
This algorithm exploits the GainRatio approach. It manages continuous attributes by determining a split point dividing the range of values into two. It manages data with missed values and run post pruning of the created tree.
Ended: Decision Tree
Classifier Models ↵
Rule-Based Classifier
The basic idea is to classify records using rule sets of the type "if .. then". The condition used with 'if' is called the antecedent while the predicted class of each rule is called the consequent.
A rule has the form: (condition) -> y
Building a model means identifying a set of rules

Coverage and Accuracy
We can have very accurate rules but with low coverage, which is not that relevant. Given a dataset D and a classification rule A -> y, we define:
- Coverage as the portion of records satisfying the antecedent of the rule
- Coverage = |A|/|D|
- Accuracy as the fraction that, by satisfying the antecedent, also satisfy the consequent
- Accuracy = |A ∩ y|/|A|
A set of rules R us said to be mutually exclusive if no pair of rules can be activated by the same record.
A set of rules R has exhaustive coverage if there is one rule for each combination of attribute values.
Properties
- It is not always possible to determine an exhaustive and mutually exclusive set of rules
- Lack of mutual exclusivity
- Lack of exhaustiveness
Rule Sorting Approach
- Rule-based sorting (individual rules are sort according to their quality)
- Class-based sorting (groups of rules that determine the same class appear consequently in the list)

Sequential Covering
set R = Ø
for each class y ∈ Y 0 y k do
stop=FALSE;
while !stop do
r = Learn One Rule(E,A,y)
remove from E training records that are covered by r
If Quality(r,E ) < Threshold then
stop=TRUE;
else
R = R ∪ r // Add r at the bottom of the rule list
end while
end for
R = R ∪ {{} -> y k } // Add the default rule at the bottom of the rule list
PostPruning (R);
Dropping instances from Training Set
Deleting instances from the training set serves the purpose of:
- Properly classified instances: to avoid generating the same rule again and again, avoid overestimating the accuracy of the next rule
- Incorrectly classified instances: to avoid underestimating the accuracy of the next rule

Learn-One-Rule
We want something that is general (even with a lower accuracy). The goal of the algorithm is to find a rule that covers as many possible examples and as few as possible negative examples.
Rule are constructed by progressively considering a new possible predicate.
- In order to choose which predicate to add, a criterion is needed:
- n = number of instances covered by the rule
- nr = number of instances properly classified by the rule
- k = number of classes
Accuracy(r) = nr/n
Some metrics (like the FoilGrain) supports the rule by identifying the number of positive examples covered by the rule.
Stop Criterion: as soon as the rule is not relevant anymore, stop it.
Rule Pruning: it aims at simplifying rules to improve rule generalization error. It can be useful given that the construction approach is greedy.
example: remove the predicate whose removal results in the greatest improvement in error rate on the validation set
The RIPPER Method
It is an approach based on sequential covering for 2-class problem and it is used to choose one of the classes as a positive class and the other as a negative class.
The idea is to compute the description length (cost for transmitting the data set from one user to another) and if it exceeds the threshold, we should stop.

Instance-Based Classifier
These classifiers do not build models but classify new records based on their similarity to the examples in the training set.
They are called lazy-diligent learners as opposed to impatient learners (rule-based, decision trees).

K-Nearest Neighbor
K-Nearest Neighbor is a simple algorithm that stores all the available cases and classifies the new data or case based on a similarity measure.
It is mostly used to classify a data point based on how its neighbors are classified.
Requirements:
- A training set
- A metric to calculate the distance between records
- The value of k (the number of neighbors to be used)
The classification process calculates the distance to the records in the training set, it identifies k nearest neighbors and uses nearest neighbor class labels to determine the class of the unknown record.
The choice of k is important because:
- If k is too small, the approach is sensitive to noise
- If k is too large, the surround may include examples belonging to other classes
Remember to normalize attributes in pre-processing, because to operate correctly, they should have the same scale of values.
Pros of KNN:
- Do not require the construction of a model
- Compared with rule-based or decision tree systems, they allow the construction of nonlinear class (more flexible)
Cons of KNN:
- Require a similarity or distance measure to assess closeness
- Require a pre-processing step to normalize the range of variation of attributes
- Class is locally determined and therefore susceptible to data noise
- Very sensitive to the presence of irrelevant or related attributes that will distort distances between objects
- Classification cost can be high and depends linearly on the size of the training set in the absence of appropriate index structures
The R-Tree Index Structure
R-trees are extensions of B+-trees to multi-dimensional spaces:
- B+-trees organize objects into a set of non-overlapping one-dimensional intervals, applying this principle recursively from the leaves of the root
- R-trees organize objects into a set of overlapping multi-dimensional intervals, applying this principle recursively from the leaves to the root

Bayesian Classifier
It is a probabilistic approach to solving classification problems. In many applications, the relationship between attribute values and that of the class is not deterministic, due to noise data, hidden variables and difficulty in quantifying certain aspects.
- Uncertainty about the outcome prediction
Bayesian classifier model probabilistic relationships between attributes and the classification attribute.

Naïve Bayes
The main advantage of probabilistic reasoning over logical reasoning lies in the possibility of arriving at rational descriptions even when there is not enough deterministic information about how the system works.
This classifier is robust toward irrelevant attributes.
It provide optimal results if:
- The conditional independence condition is met
- The probability distributions of P(X|Y) are known

13.5 is the solution that minimizes the error.
Probability with Continuous Attributes
In case attribute A is continuous, it is not possible to estimate probability for each of its values.
We need to discretize the attribute into intervals by creating an ordinal attribute. If too many intervals are used, the limited number of training set event per interval makes the prediction unreliable.
We associate the attributes with a density function and estimate the parameters of the function from the training set to estimate P(A|C).
Multi-Classifier
Construct multiple base classifiers and predict the class to which a record belongs by aggregating the classification obtained.
How to build a composite classifier
- Changing the training set by building more training set from the given one
- Change the attributes (random forest)
- Changing the classes considered (translate a multi-class classification into a binary one)
- Change the parameters of the learning algorithm
Error Decomposition
Classifiers make mistakes in predictions, due to:
- Bias: ability of the chosen classifier in modeling events and extending the prediction to events not in the training set
- Variance: capability of the training set in representing the actual data set
- Noise: non-determinism of the classes to be determined
Different types of classifiers have inherently different capabilities in modeling the edges of regions. The difference between the true separation line and the average separation line represents the classifier bias.
Bagging (variance)
Bagging allows the construction of compound classifiers that associate an event with the highest rated class from the base classifiers.
Each classifier is constructed by bootstrapping the same training set.
bootstrapping: any test or metric that uses random sampling with replacement

Bagging determines the behavior of a two-level decision tree.

Random Forest
It is a bagging method in which base classifiers are decision trees: For each node in the decision tree, the split attribute is chosen on a random subset of features rather than on the entire set of features.
Random forest performs two types of bagging: one on the training set and one on the feature set.
Boosting
An iterative approach to progressively adjust the composition of the training set in order to focus on incorrectly classified records.
- Initially, all N records have the same weight (1/N)
- Unlike bagging, the weights can change at the end of the boosting round in order to increase the probability of the record being selected in the training set
The final result is obtained by combining the result the predictions made by the different classifiers.
One of the most widely used boosting techniques is AdaBoost:
AdaBoost complex on the most complex part of the dataset.

Ended: Classifier Models
Association Rule ↵
Association Rule
The idea of association rule is basket analysis. We have a set of transactions (set of elements coming from a huge set).
A classic example of association rule coming from data mining literature is the association {Diaper} -> {Beer}
Applications
- Marketing sales promotion (understand which products could be affected in the event that the store interrupted the sale of a specific product)
- Arrangement of goods (to identify the products bought together by a sufficiently large number of costumers)


Problem Formulation
Given a set of transactions T, you want to find all transactions such that:
- Support >= minsup
- Confidence >= minconf
Naive Approach:
- Lists all possible association rules
- For each rule calculates support and confidence
- Eliminate rules that do not meet thresholds for minsup and minconf

All rules are binary partitions of the same itemset: {Milk, Diaper, Beer} and rules based on the same itemset have the same support but they may have different confidence.
Searching association rules follow a two-steps approach:
- Generate frequent itemsets
- Rule generation (for each itemset, generate the rules with high confidence. Each rule is a binary partitioning of the elements in the itemset)
Frequent Itemsets Generation
Frequent itemsets can be identified following the apriori principle:
If an itemset is frequent, then all its sub-sets must be frequent too.
- The support of an itemset does not exceed the support of its subsets
- This is known as the anti-monotonic property of the support
A frequent itemset is maximal if no one of its adjacent superset is frequent. On the other hand, an itemset is closed if none if its adjacent superset has the same support.
Closed vs Maximal itemsets
From an efficiency point of view:
- They provide a more compact representation than frequent ones, which is relevant when space is an issue.
- Only closed itemsets determine a lossless compression of frequent patterns, which contain complete information regarding the frequent itemsets but closed itemsets are fewer in number than frequent itemsets.
- From the semantic point of view, maximal itemsets are the most complex while closed ones can be interesting if they are supported by groups with largely different support
Rule Generation
Given a frequent itemset L, find all the non-empty subsets f ⊂ L such that f -> L - f that fulfills the minimum confidence constraint.
The confidence measure does not have the property of anti-monotonicity with respect to the overall associative rule. However, it is possible to take advantage of the anti-monotonicity of the confidence with respect to the left-hand side of the rule.
Interestingness
The objective measure is to prioritize rules based on statistical criteria calculated from data. The subjective measure is to prioritize rules based on user-defined criteria.
According to the latter, a patter is interesting if:
- It contradicts the users' expectations
- The user is interested in performing some activities or making decisions regarding its elements
Dataset Support
Inhomogeneous Support
Many datasets have itemsets with very high support along with others with very limited support. A large commerical chain sells products with price ranges from €1 to €10,000. The number of transactions that include products with low price is much higher than those with high price. However, the associations among them are of interest of the company.
Setting the minsup threshold fot these datasets can be very difficult.
Cross-support Pattern
A cross-support pattern is an itemset x = {i...1n} where the support ratio r(x) is lower than a threshold thr.

Confidence Limits
The case of cross-support patterns has shown the limits of support.
The confidence limitations is due to the fact that it does not consider the itemset support in the right.hand side of the rule and therefore, it does not provide a correct assessment in the case where the item groups are not stochastically independent.
The lift value of an association rule is the ratio of the confidence of the rule and the expected confidence of the rule.
Handling Categorical and Continuous Attributes
In its basic formulation, association rule works with binary and asymmetric variables.
Binarization is needed to transform categorical attributes into asymmetric binary attributes by introducing a new item for every possible attribute value.
Association rules that include attributes with continuous values are called quantitative association rules. Continuous attributes can be handled through several approaches:
- Based on discretization
- Based on statistics
- Without discretization
The discretization poses the problem of how to fix the number and the border ranges. The number of intervals is usually supplied by the users and can be expresses in terms of:
- Range of intervals (equi-distant discretization)
- Average number of transactions per interval (equi-depth discretization)
- Cluster number
The choice of the width of the intervals affects the value of support and confidence:
- Too large intervals reduce confidence
- Too narrow intervals reduce support and tend to determine replicated rules

One possible solution is to try all possible intervals (brute force).
Multi-level Association Rules
A hierarchy of concepts composed by generalization based on the semantics of its elements.
The higher the level, the higher the support, leading to generic rules.
Hierarchies of concepts are incorporated for the following reasons:
- Rules at the lower levels may not have sufficient support to appear in frequent itemsets
- Rules at lower levels may be too specific
Multi-level associative rules can be handled with the algorithms already studied by extending each transaction with the parent items of items in the transaction.
Adding details (skimmed milk, white bread) does not add any value and increases complexity.

Sequential Pattern
Often, temporal information is associated with transactions, allowing events concerning a specific subject to be linked together.
A sequence is an ordered list of elements, each of which contains a set of events (items). Each item is associated with a specific time instant or ordinal position.
The length of the sequence is given by the number of elements in it.
Sub-sequence
We have sequences (like ordered purchase list from the same customer) and sub-sequences are sequences contained in a sequence where the mapping respects the order.

The support of a sub-sequence w is defined as the fraction of sequences that contain w.
Mining Sequential Pattern
Given a database of sequences and a minimum support threshold, minsup find all subsequences whose support is >= minsup
Apriori Principle can be applied to sequential pattern mining since any sequence s that contains a particular k-sequence must contain all (k - 1) subsequences of s.
The steps to follow in this process, include:
- Run an initial scan of the sequence DB to locate all 1-sequence
- Repeat until new frequent sequences are discovered
- Candidate generation: find pairs of frequent subsequences found in step k-1 to generate candidate sequences containing k items
- Candidate pruning: eliminate k candidate sequences that contain (k-1) subsequences that are not frequent
- Support counting: scan the DB to find the support of candidate sequences
- Candidate elimination: eliminate candidate k-sequences whose support is actually less than minsup

Searching for sequential patterns is a difficult problem given the exponential number of subsequences contained in a sequence.
Temporal Constraints
Temporal constraints increase the expressiveness of sequential pattern by better defining their structure.
- MaxSpan defines the maximum time interval between the first and the last sequence element
- MinGap defines the minimum gap between events belonging to two different elements
- MaxGap defines the maximum gap between events belonging to two consecutive elements
Sequence Mining with Temporal Constraints
Temporal constraints impact on sequence supports as some patterns counted as frequent may not be true because some of the sequences in their support may violate a time constraint.
It is necessary to modify the counting technique to account for this problem.
The Time Window Size (ws) conversely relaxes the support basic definition as it specifies the interval within which two events occurring at different times should be considered simultaneous.
Outlier Detection
An anomaly is a pattern in the data that does not conform to expected behavior.
Anomalies can be caused by different aspects:
- Data from different classes (an object may be different because it belongs to different class).
- Natural variations (many phenomena can be modeled with probabilistic distributions in which there is a probability that a phenomenon with very different characteristics from others will occur).
- Measurement errors (due to human or device errors).
We can identify different types of anomalies:
- Spot anomaly (an individual data instance is anomalous with respect to data)
- Contextual anomaly (a single instance of data is anomalous within a context)
- Collective anomaly (a set of related instances is anomalous and requires a relationship between data instances)
Outliers application
Data from different classes, for example, can be used to identify a different purchase pattern followed by fraudsters who stole a credit card. Also, intrusion detection can be used to monitor events occurring in a computer system and analyze them for intrusion.
In the healthcare sector, outliers can be useful to detect abnormal data, disease outbreaks or instrumentation errors.
Finally, in the industrial sector, anomalies detection can be used to identify failures and malfunctions in complex industrial systems, intrusions in security systems, suspicious events in video surveillance and abnormal energy consumption.
Anomaly detection follows different approaches:
- Supervised anomaly detection, where labels are available for both normal and anomaly data.
- Semi-supervised anomaly detection, where labels are available for normal data.
- Unsupervised anomaly, where labels are not available and validation is complex since the real anomaly number is unknown.
Anomaly Detection Outputs
- Label, each test instance is assigned a label which is the outcome of classification-based approaches.
- Score, each test instance is assigned an anomaly score which allows the instance to be sorted
Anomalies are rare events which make it difficult to label these with high accuracy. Swamping is the error of labelling normal events as anomalies while masking is the error of labelling anomalous events as normal.
Ended: Association Rule
Clustering ↵
Clustering
Clustering analysis aims at finding groups of objects such that objects that belong to the same group are more similar to each other than objects belonging to different groups.
Clustering is NOT supervised classification (it assumes classes to be known) or segmentation (partition rule is given) or querying a database (the selection and grouping criteria are given).
Types of clustering
We can distinguish between:
- Partitioning clustering: a division of objects into non-overlapping subsets (clusters), in which each object belongs exactly to a cluster.
- Hierarchical clustering: a set of nested clusters organized as a hierarchical tree.
- Exclusive vs non-exclusive: in non-exclusive clustering, points can belong to multiple clusters.
- Fuzzy vs non-fuzzy: in a fuzzy clustering a point belongs to all clusters with a weight between 0 and 1.
- Partial vs complete: in a partial clustering, some points may not belong to any of the clusters.
- Heterogeneous vs homogeneous: in a heterogenous cluster, clusters can have very different sizes, shapes and densities.
Similarly, we can identify different types of clusters:
- Well-separated clusters: each point in the cluster is closer (more similar) to any other point in the cluster than any other point that does not belong to the cluster.
- Center-based clusters: a point in the cluster is closer to the center of the cluster, rather than to the center of each other cluster.
- Cluster center = centroid
- Contiguous clusters (nearest neighbor): a point in a cluster is closer to one or more other points in the cluster than to any point not in the cluster.
- Density-based clusters: a cluster is a dense region of points, which is separated by low-density regions from other regions of high density.
- Conceptual clusters: clusters with shared properties or in which the shared property derives from the whole set of points.
K-means Clustering

Initial centroids are often chosen randomly (clusters produced vary from one run to another).

Converge and Optimality
There is only a finite number of ways to partition n records into k groups. So, there is only a finite number of possible configurations in which all the centers are centroids of the points they possess.
If there are K real clusters, the probability of choosing a centroid from each cluster is very limited.
Some solutions to this problem, include:
- Run the algorithm several times with different centroids.
- Perform a sampling of the points and use a hierarchical clustering to identify k initial centroids.
- Select more than k initial centroids and then select the ones to use from these.
- Use post-processing techniques to eliminate the identified erroneous cluster.
- Bisecting K-means (less affected by the problem).
Handling empty clusters
The K-means algorithm can determine empty clusters if, during the assignment phase, no element is assigned to a centroid.
In this case, different strategies can be used to identify an alternative centroid:
- Choose the item that most contributes to the value of SSE (sum of squared errors).
- Choose an item of the cluster with the highest SSE (the cluster will split into two clusters that include the closest elements).
Handling Outliers
The goodness of clustering can be negatively influenced by the presence of outliers that tend to shift the cluster centroids so that to reduce the increase in the SSE they determine.
Outliers, if identifies, can be eliminated during preprocessing.
Limits of K-means
The k-means algorithm does not achieve good results when natural clusters have:
- Different size (the value of SSE leads to the identification of centroids so as to have clusters of the same size if the clusters are not well-separates)
- Different density (more dense clusters lead to smaller intra-cluster distances, so less dense areas require more medians to minimize the total value of SSE)
- Non-globular shape (SSE is based on an Euclidean distance that does not take into account the shape of objects)
- Data contains outliers
Possible Solutions:
- Use a higher k value, thus identifying portions of clusters
- The definition of natural clusters then requires a technique to bring together the identified clusters
- Elbow method: execute k-means several times with increasing values for k
With the elbow method, in the beginning the error decreases but then, at some point, the curve will be flattened as we dropping some intra-clusters.
The natural number of clusters is located in the point in which we have the elbow.

Hierarchical Clustering
Hierarchical clustering produces a set of nested clusters organized as a hierarchical tree (it can be visualized as a dendrogram).
There are two approaches to build a hierarchical clustering:
- Agglomerative (start with the points as individual clusters and, at each step, merge the closest pair of clusters until only one cluster is left).
- Divisive (start with one, all-inclusive cluster and, at each step, split the cluster until each cluster contains an individual point).
The key operation is the computation of the proximity of two clusters. Different approaches to defining the distance between clusters distinguish the different algorithms.
Inter-cluster Distances
- MIN: minimum distance between two cluster points
- MAX: maximum distance between all cluster points
- Group Average: average distance between all the cluster points

Using MIN links, we can have non-globular clusters.

Having the MAX distance as reference distance, we will get more globular clusters and once they are put together, they cannot be split (greedy).
Computation Complexity
With hierarchical clustering:
- O(N^2) is the space occupied by the proximity matrix when the number of points is N.
- O(N^3), where N refers to the steps needed to build the dendrogram.
- At each step, the proximity matrix must be updated and read
- Prohibitive for large datasets
DBSCAN
DBSCAN is density based approach where density refers to the number of points within a specific radius.

A point is a core point if it has at least a specified number of points within Eps (points that are the interior of a cluster).
A border point is not a core point but is in the neighborhood of a core point.
A noise point is any point that is not a core point o a border point.
DBSCAN algorithm
// Input: Dataset D, MinPts, Eps
// Output: set of cluster C
Label points in D as core, border or noise
Drop all noise points
Assign to cluster c the core points with a distance > Eps from one of the other points assigned to the same cluster
Assign border points to one of the clusters the corresponding core points belong to
Pros and Cons of DBSCAN:
- Pros:
- Resistant to noise
- It can generate clusters with different shapes and sizes
- Cons:
- Data with high dimensionality
Cluster Validity
For supervised classification techniques, there are several measures to evaluate the validity of the results based on the comparison between the known labels of the test set and those calculated by the algorithm.
Validity Measures:
Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following three types:
- Internal index (used to measure the goodness of a clustering structure without respect to external information)
- External index (used to measure the extent to which a cluster labels match externally supplied class labels)
- Relative index (used to compare two different clusters)
Internal Measures:
- Custer Cohesion (measures how closely related are objects in a cluster - SSE)
- Cluster Separation (measure how distinct or well-separated a cluster is from other clusters)

Validity can be measured via correlation:
- Two matrices are used
- Proximity Matrix
- Incidence Matrix
- Compute the correlation between the two matrices
- High correlation indicates that points that belong to the same clusters are close to each other
- Not a good measure for some density based clusters
- Correlation between the incidence matrix and the proximity matrix on the results of the k-means algorithm on two data sets
Cophenetic Distance

To define whether the measures obtained are good or bad, we need to define some KPIs obtained by comparing our results with the results obtained with random data.
We are looking for non-random patterns, so, the more atypical the result we get is, the more likely it is to represent a non-random pattern in the data.
The issue of interpreting the measure value is less pressing when comparing the results of two clustering.

External measures for clustering validation:
External information is usually the class labels of the objects on which clustering is performed. They allow you to measure the correspondence between the computed label of the cluster and the class label.
If class labels are available, we perform clustering to compare the results of different clustering techniques and evaluate the possibility of automatically obtaining an otherwise manual classification.
Two approaches are possible:
- Classification-oriented (evaluate the extent to which clusters contain objects belonging to the same class)
- Similarity-oriented (they measure how often two objects to the same cluster belong to the same class
Ended: Clustering
Machine Learning ↵
Introduction to Machine Learning
What are the main features of intelligence?
Intelligence is a very general mental capability that, among the other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience.
Artificial Intelligence is a huge set of disciplines which also includes machine learning. With machine learning, we refer only to a small subset inside artificial intelligence.

We, as humans, take many activities for granted that for machines would be very complex. Simulating human intelligence is extremely complex as our brain in an incredibly sophisticated machine, of which we still know few aspects.
Impact of AI in our world

For both graphs, we can note the exponential trend and the variety of continents covered.
Information is the oil of 21st century, and analytics is the combustion engine
The number of requests per position as data scientist is constantly increasing as companies need to extract knowledge from data to survive.
The revolution introduced by AI is reflected also in companies. At least 5 of the top 10 world companies are directly related to AI. Also, at least 2 companies are directly related to the production of chips, key elements for AI.
Nvidia
It is a software and fabless company that design GPUs, which nowadays are essential to:
- Create AI models
- Perform High Performance Computing (HPC)
Nvidia is the leading company in the sector and this is the reason why its shares has risen significantly in recent years.
The General Paradigm of Machine Learning
Machine Learning is a subset of the AI field that tries to develop systems able to automatically learn from specific examples (training data) and to generalize the knowledge on new samples (testing data) of the same domain.
From a practical point of view:
- We have some data which represents our application domain
- We implement an algorithm able to learn from the data (training phase)
- We use data to understand if the trained model has learned something -> model deployment
The main steps for the development of intelligent systems based on ML:

Data Acquisition
Data is the founding element of any application related to ML. Acquiring large amounts of data is one of the main concerns for top-companies today.
Data Processing
All those techniques with which data are processed in order to adapt them to the best of the ML model that we plan to develop.
Model
This is the main core of AI systems. A model can be seen as a set of mathematical and statistical technique able to learn from a certain distribution of data provided in input and to generalize on new data.
Prediction
It can take many forms depending on the application developed. It is the output of the model and it is important to evaluate the effectiveness of the developed system.
Historical Evolution of AI
To understand why AI is so important today, we have to analyze the past.
In 1950 the enthusiasm for AI began:
- Turing Test: "Can machines think?"
- 1954: one of the main experiments in machine translation
- 1955: Arthur Samuel wrote a program that could play checkers very well
- 1957: Rosenblatt invented perceptrons, a type of neural network
First AI Winter - promises of AI were exaggerated
In 1980 the Boom times occurred:
- Commercialization of new AI Expert Systems capable of reproducing human decision-making, through "if-then-else" rules
- Financial planning, medical diagnosis, geological exploration, and microelectronic circuit design
Second AI Winter - many tasks were too complicated for engineers
In 2012 the Deep Learning revolution took place
- Solved mathematical problems
- New powerful Neural Networks
- Huge improvement with the computational power
- Introduction of GPUs
Problem with data
- AI models need huge amount of training data
- Currently, we are able to:
- Acquire a lot of data (IoT)
- Store huge amount of data (improved storage)
Today, the question is not if we are able to collect data, but if we are able to use them.


Ended: Machine Learning
Data Acquisition and Processing ↵
Data Acquisition and Processing
Data acquisition and processing are the first steps in the Machine Learning pipeline. This is one of the most important steps for many companies, but acquiring data is a time-consuming, investment and knowledge intensive process.
Big Data: having large amounts of data available has been one of the reasons for the strong development of machine learning. We are able to collect large amounts of data thanks to new storing devices and process digitalization.
Data Acquisition
Data acquisition is the first step in developing a machine learning system. We can get data mainly in two ways:
- By using publicly available data (quality must be checked)
- By acquiring a new set of specific data (generate specific expertise for the company)
- It is not certain that public data well represent the problem we want to solve
- We are forced to acquire data that, due to their sensitive nature, would not otherwise available (privacy issues)
- The company we work for already has a data collection process that we can use
Public Datasets
Many Universities publicly release their datasets. There are no requirements related to profit or non-disclosure agreement. It is a consolidated practice in the world of research to share data to test the reproducibility of the results obtained.
Acquisition of a new dataset
Acquiring a new dataset is usually a costly process (time and money).
- Program acquisition tool
- Handle large amounts of data
- Test to find bugs
- New hardware
It is necessary to carefully consider whether it is appropriate to acquire a new dataset.
Acquiring a new dataset does not mean acquiring only new data
Data Annotation
It is one of the most relevant aspect in the data acquisition phase. It regards the semantic content of the data and the label depends on the problem we want to solve.
It can be numerical or categorical and associates a label to data.
Data collection without correct and timely annotation is often useless. However, it also possible to extract knowledge from un-annotated data through clustering.
Data Annotation Process
The data annotation process can take place in several ways:
- Manual (long and expensive but the quality of the annotations is usually controllable and high)
- Automatic (each data is automatically annotated using specific tools)
- Third parties (all data is noted by a third party)
- Free of charge (free use of a platform in exchange for annotated data)
- Paid (purchase annotation time from third parties, usually from developing countries)
Closed Set: the pattern to be classified belongs to one of the known classes
Open Set: you do not know all the possible annotations, so the pattern to be classified can belong to one of the known classes or to non of these. You can define a threshold above which a specific pattern is assigned.
Problems in Data Acquisition
Companies usually face common problems:
- The business process produces huge amounts of data (it is impossible to acquire all the data due to physical limitation)
- Sometimes companies have a lot of old data in their databases
- In many business processes it is unclear understanding which data is possible to collect or which data is really useful for the business
Data Types
In general, there are 4 types of data:
- Numerical:
- Values associated with measurable characteristics
- Continuous (subject to ordering)
- Representable as numerical vectors
- Categorical:
- Qualitative characteristics
- Presence or absence of a characteristic
- Sometimes subject to sorting
- Widely used in Data Mining
- Sequences:
- Sequential patterns with spatial or temporal relationships
- With a variable length
- Position in the sequence and relationship with predecessors and successors are important
- Structured data:
- Outputs organized in complex structures such as trees and graphs
Images
An image is a matrix of values in which each cell is referred as pixel and each pixel contains the value of the brightness.
![]()
In color images, each pixel contains 3 values that represent the color components, referred as channels. The content of channels are related to the color space (the convention used to define colors).
- RGB color space: 3 values indicate the value of three components (Red, Gree and Blue).
Image formats

Data Preparation
Once obtained data for our Machine Learning system, it is necessary to prepare them.
They are organized as follows:
- Training set (data on which the model automatically learns during the learning phase)
- Validation set (part of the training set in which hyper-parameters are tuned)
- Training set (data on which the model is tested during testing phase in order to model effectiveness through qualitative and quantitative numerical measures)

Deployment
Once the previous phases have been completed, the ML system can be released for its effective use.
Normally, once the model is released, it no longer goes through training and testing phases.
Different ways to train-val-test
We can identify alternatives approaches adopted by choice or imposed by context:
- Batch: the training is carried out only once on a given training set.
- Incremental: following the initial training, further training sessions are possible
- Natural: this is the closest case ti the human learning process
Different Ways of Learning
Not all data is always annotated. Depending on whether they are annotated, we can define different types of learning:
- Annotated data -> Supervised Learning
- Not annotated data -> Unsupervised Learning
- Partially annotated data -> Semi-supervised Learning
Specific algorithms correspond to each of these areas. Usually, the presence of annotations helps and simplifies the development of ML algorithms. The best performances are usually obtained with supervised trained algorithms.
Ended: Data Acquisition and Processing
Model ↵
Model
The model is the heart of the AI in our system. It is one of the most delicate and decisive elements of the entire process:
- Model -> mechanism with which input data are transformed in outputs
Machine Learning Tasks
There are different tasks in ML depending on the output we want:
- Classification
- Regression
- Clustering
Classification
- We have a specific input, a model (classifier) which outputs a class (pattern)
- If there are only 2 classes, we call the problem *binary* classification, while with multiple classes, we have *multi-class* classification
class = data set having common properties
The concept of label and semantic is related to the concept of class, since it strictly depends on the working context.
Examples of classification:
- Spam detection
- Input: email texts
- Output: yes/no (spam)
- Credit card fraud detection
- Input: list of bank operations
- Output: yes/no (fraud)
- Face recognition
- Input: images
- Output: identity
- Medical diagnosis
- Input: x-ray images
- Output: benign/malignant (tumor)

Regression
Given a specific input, the model (regressor) outputs a continuous value (data -> value). You can see a regression task as a classification task with a high number of classes
Examples of regression
- Estimation of a person's height based on weight
- Estimated sale prices of apartments in the real estate market
- Risk estimation for insurance companies
- Energy prediction produced by a photovoltaic system
- Health costs prediction models

Clustering
Identify groups (clusters) of data with similar characteristics, usually applied in an unsupervised learning setting (patterns are not labeled and classes are not known in advance).
Usually, the unsupervised nature of the problem makes it more complex than classification.
Examples of clustering
- Marketing (user groups)
- Genetics (group by DNA)
- Bioinformatics (partitioning of genes)
- Vision (unsupervised segmentation)

Artificial Vision
For artificial vision domain, we can identify even more specific problems

Pattern Recognition
Pattern recognition is the discipline that studies the recognition of patterns (data) even with pre-programmed algorithms (not able to learn automatically)
The model is a set of hand-crafted instructions.
Technique similar to the Expert System developed in the '80, which was a first form of artificial intelligence. The ability of a calculator to perform calculations on large amounts of data is exploited.
The programmer develops a series of instructions to solve specific problem:
- These instructions are typically based on if-then-else statements
- A strong priori knowledge of the problem is required
Problems that can be faced with explicitly programmed instructions:
- The conditions are stable and known a priori (constrained industrial environment)
- There are mathematical formulas to model the problem
- The problem must be limited in dimensionality and not too complex

Explicitly Programmed Instructions
General and technical considerations about pattern recognition:
- It can achieve a high degree of success if the a priori knowledge is adequate, dimensionality of the problem is limited and the test domain is similar to what was assumed when defining the instructions
- The developed solution will inevitably be specific
- There is not a real learning phase
- It is possible to understand why the developed system fails in classification
- If the problem becomes complicated, the programming time increase
- The code risks becoming unmanageable due to:
- High complexity and number of innested statements
- Length of code
- Too specific functions
Limits of programmed instructions

Solving these problems with instructions is very complex. The level of generalization of the proposed solution would be very limited.
It is necessary, when needed, to address the problems with other paradigms.
Ended: Model
Classification ↵
Classification with Machine Learning
How can machine learn?
The key element is data: machine can learn from data. This is the reason why data is so relevant today.
Currently, we can program machines that imitate this way of learning of humans. Humans learn in many different ways and we only imitate just one with machines.
Learning from data is similar to humans that learn to paly a musical instrument:
- Observe how a chord is created (annotated data)
- Repeat the chord (iterative learning process)
- Feedback (loss function)
From a practical point of view, these are required steps:
- We get the annotated data
- We pre-process data (make them suitable for the algorithm)
- We iteratively train a classifier
- We measure the performance of the implemented solution
Machine Learning refers to the discipline that aims to develop systems able to automatically learn from (training) data and to generalize the knowledge on new (testing) data.
A machine learning model makes predictions without being explicitly programmed to do so.
Thanks to machine learning we can avoid complex operations of writing predefined instructions to solve a specific problem.
Support Vector Machines (SVMs)
It is a supervised learning method used for classification, regression and outliers' detection.
It is effective with high dimensional inputs and still effective in cases where number of dimensions is greater than the number of samples.

We have a point with two dimensions (x, y) and two patterns (orange and blue).
SVM identifies a pattern (hyperplanes) that divides the cluster in two groups.
Hyperplanes are decision surface, there can be infinite possible solutions but SVM finds the optimal one.
Support vectors: data points that lie closest to the decision surface
- Data points most difficult to classify
- Directly influence the optimum location of the decision surface
- They are the element of the training set that would change the position of the dividing hyperplane if removed
- SVMs maximize the margin between support vectors
- The decision functions is fully specified by a subset of training samples, the support vectors
- This becomes a quadratic programming problem that is easy to solve by standard methods
What if patterns are not linearly separable?
The idea is to still obtain a linear separation by mapping the data to a higher dimensional space. The mapping procedure is realized through a kernel function.
If we have more than two classes, we can adopt two solutions:
- One-Against-One: classifiers trained on all possible class couples
- One-Against-All: one SVM trained for each class (the SVM that has the better margin decides the final class)
Linear and Non-linear Kernel
- If the dimensionality of the space is very high, linear SVM is generally used
- Fow low dimensionality, the primary choice is non-linear SVM with RBF kernel
- For medium dimensionality both types are generally tried
Remember, the hyperparameter are calibrated on a separate validation set, or through cross validation.
Decision Trees
Tree-like model to perform the classification. They are commonly used in operational research, specifically in decision analysis.

If we add a class, we need to add another decision node.
Decision Tree Training
The root node:
- We want a decision that makes a good split (separating classes as much as possible)
- Quantify a good split by using a measure (Gini index, entropy ..)
- Different possible algorithms that recursively evaluate different features and use at each node the feature that best splits the data
The second node:
- Let's go the left branch
- We use only data that belong to the left branch
- We do the same thing we did in the root node
- We apply this procedure to all the other nodes
We stop the training when the selected measure is not further increased after some iterations.
Ensemble Methods
A multi-classifier is an approach where several classifiers are used together, wither in parallel or in cascade.
It has been shown the use of combinations of classifiers can strongly improve performance. The combination is effective only when individual classifiers are independent. Unfortunately, it is very difficult to have real independence between classifiers.
Two approaches:
- Bagging: I train different classification algorithms on different portions of the training set
- Boosting: I train different algorithms on incorrectly classified patterns
How to merge decisions of the individual classifiers:
-
Decision level
- Majority vote rule (each classifier vote for a class and the pattern is assigned to the highest rated class)
- Borda count (each classifier produces a ranking of the classes, the rankings are converted into scores and the class with the highest final score is the one chosen)
-
Confidence Level
- Each classifier outputs a confidence value, and these values are merged
- Weighted sum (the sum of the confidence values is performed by weighting the different classifiers according to their degree of skill)
- The sum is often preferable to the product as it is more robust (in the product it is sufficient that a single classifier indicates zero confidence to bring the confidence of the multi classifier to zero)
Random Forest - based on Bagging
The single classifier on which random forest is based is the decision tree (hundreds or thousands of DT).
In random forests, we have two types of bagging:
- Data Bagging (RF repeatedly selects a random sample with replacement of the training set and fits trees to these samples)
- Feature bagging (in each decision node, the choice of the best feature on which to partition is not made on the entire set of d feature)
The final decision is taken upon the majority vote rule.
Adaboost - based on boosting
Several weak classifiers are combined together to obtain a strong classifier. Differently from bagging, there is an incremental learning phase, at each step a weak classifier is added.

Feature Description
The learning phase is complex with high-dimensionality data as images. For instance, what if in input we have RGB images?
Feature extraction refers to the process ot extracting features from data. A feature is a n-dimensional vector of numerical features that represent (in a discriminative way) some object used as input data.
Example of features:
Object: geometric shape
- Data: array of values
- Features: subset of coordinates or a new value that we can compute from coordinates
Object: image
- Data: matrix of values
- Features: subset of pixels or a new value that we can compute from pixels
Histogram of Oriented Gradients (HOG)
A visual feature descriptor that can describe the shape of an object. HOG provides the edge direction:
- The whole image is divided into smaller regions
- For each region, the edge directions are calculated
- Edge: curves at which the brightness changes sharply
- Direction: angle and magnitude of edges

Local Binary Pattern (LBP)
A visual feature descriptor that can describe the texture of the surface (visual surface appearance).

Metrics
The prediction of the system is an extremely important step. It allows you to calculate the performance of the system through metrics, to understand if the system responds correctly with respect to what was designed and desired.
The computation of metrics is also linked to the achievement of certain contractual obligations.
System Performance with Classification
Generally, it is preferred to use a measure linked directly to the semantics of the problem. The metric examines the prediction of the model and the label provided in input (GroundTruth and prediction).
Accuracy = # pattern correctly classified / total patterns
Confusion Matrix
The confusion matrix is very useful in multi-class classification problems to understand how errors are distributed.
- Rows: classes of GT
- Columns: predicted classes
A cell (r,c) shows the percentage in which the system predicts class c for a ground truth class r.
Ideally, the matrix should be diagonal and high values (off diagonal) indicate concentrations of errors.

In the example, the class "0" is often confused with "6".
Classification with Deep Learning
Deep learning is a discipline, similar to ML, that allows you to avoid the problematic phase of feature extraction with high-dimensional input.
Is is based on neural networks (NNs) classifier. The key idea is to imitate, as far as possible, he nature (neurons in the human brain).

The first Artificial Neuron has been introduced in 1943 by McCulloch and Pitts. In artificial neuron, there are:
- Inputs (digital numbers)
- Inputs are weighted (not all inputs are equally important)
- Inputs are merged with a sum function (plus a bias)
- An activation function is used to generate the final output
Activation Function
The activation function defines the output of that node given an input or set of inputs. They are a sort of switch of the artificial neuron; they output a small value for small inputs and a larger value if its inputs exceed a threshold.
Liner vs Non-linear problems
A single artificial neuron can solve only linear problems.
The solution is to use more ANs organized on different layers (Multi Layer Perceptron). It is not easy as it introduces several mathematical problems, besides, we greatly improve the computational load.
Artificial Neural Networks
Groups of artificial neurons are organized in different layers: neural networks.
Typically, they present:
- An input layer (input of the network)
- An output layer (output of the network)
- One or more hidden layers
Each neuron is fully connected with those of the next level. Again, we try to imitate the hierarchical nature of our neurons:
- We have only ten levels between the retina and the actuator muscles.
- Otherwise, we would be too slow to react to stimuli.
ANN Typologies
- Feedforward (FF): the connections connect the neurons if one level with the neurons of the next level.
- Connections to the same level and backward connections are not allowed.
- It is by far the most used type of network.
- Recurrent: feedback connections are expected (towards neurons of the same level but also backward).
- More suitable for the management of sequences because they have a short-term memory effect
Neural Networks Training
General considerations about NN layers:
- Greater number of hidden layers (and neurons) -> better performance
- Greater number of hidden layers (and neurons) -> need for more training data
- Greater number of hidden layers (and neurons) -> greater computational load
How is it possible to train a neural network?
Training a neural network is extremely complicated, but we can use specific frameworks:
- PyTorch
- TensorFlow
- Mxnet

Inside the neural network, we can change the weights applied to each input.
Training a NN means minimizing the loss function. The cost function:
- It is a mathematical formulation of the learning goal
- It measures the error between the prediction and the ground truth
- Presents the performance in the form of a single real number

How to minimize the loss? -> adjusting the weights and the bias of every neuron
We can change weights and bias following gradients (the derivative of a function measures the sensitivity to change of the function value with respect to a change in its argument).
We can minimize the loss function with a gradient descent approach that adjust weights in the following manner:


Convolutional Neural Networks
CNNs are particular Neural Networks specifically designed to process images. Instead of using a flat structure, it arranges neurons on three dimensions.

How is it possible to connect a 3D input with a kernel?
To do so, we need a mathematical operation called convolution. The convolution is the core building block of convolutional neural networks.
Each kernel is convolved with the input volume thus producing a 2D feature map. With this process, one feature map for each kernel is produced.
We usually have more than one kernel as output is usually larger than 1. The output volume is then made up by stacking together all activation maps produced on the top of the other.

CNN Architecture
In multi-layers architecture we have a flat structure with different layers.
With CNN, we have a bunch of layers stacked one on the top of the other:

Differently from HOG, CNN automatically learn how to extract features, so there is no need to specify parameters.
Convolutional layers learn to extract various types of visual information in a hierarchical manner:
- In the layers close to the input, CNNs learn filters to extract 'simple' visual information.
- In the layers placed in depth, the filters extract semantically complex visual information.
The interesting thing is that this mechanism seems similar to what happens in our brain, where the visual cortex processes information by different layers.
Pooling Layer
Pooling layers spatially subsample the input volume (reduce the input size). They are widely used for a number of reasons:
- Gain robustness to exact location of the features
- Reduce computational cost
- Help preventing over-fitting
- Increase receptive field of following layers
Other Layers
Activation layer: activation function used with neural networks.
Flatten layer: usually exploited to connect the 3D feature extractor to the 1D classifier (like MLP wih unrolled img).
It is possible to build a personalized architecture but it is very complex so, we will use already implemented ones:
- AlexNet
- ResNet
- VGG
Train CNNs
CNNs can be trained in different ways:

Machine Learning vs Deep Learning

The superiority of DL approaches compared to other ML algorithms manifests itself when large (huge) quantities of training data are available.
The training of neural networks requires a specialized hardware:
- Before starting a project with DL, you need to ask if the company / lab has the necessary hardware
- Having one or more GPU available is a fundamental factor today (GPU are essential for parallelizing calculations)
- The deeper a network is, the more computational load is introduced
Hardware purchase for DL
With in-house solutions, the company buys the necessary hardware and is the direct owner:
- Pros:
- Extreme freedom of use of hardware
- In the long run, it tends to have lower costs
- Cons:
- Hardware maintenance is required (specialized technicians)
- Hardware ages quickly
- For large number of GPUs -> specific server rooms (with high energy consumption)
- The GPU market is quite expensive and volatile
With external solutions, the hardware is rented through the PaaS paradigm (Cloud).
- Pros:
- Hardware maintenance is not required
- No investment over time is required for hardware upgrades
- Dedicated server rooms are not required, energy consumption is not borne by the company
- Cons:
- In the long run, it tends to have higher costs
- Vendor lock-in
- We do not really know who the owner of the data is
- Privacy issues

Ended: Classification
DM - LAB ↵
Weka
Weka is an open-source software for Data Mining and Machine Learning written in Java, distributed under the GNU public license.
It includes four applications:
- Explorer - we will use explorer
- Experimenter
- Knowledge Flow
- SimpleCLI
The main file format used in Weka is ARFF (attribute-relation file format), which is a comma-separated value format.
Weka files store relations, attributes and values.
@attribute age numeric
@attribute sex {female, male}
@attribute cholesterol numeric
@data
63, male,233
67,male,286
Bank Data
Attributes:

PEP class (Personal Equity Plan)
Pre-Processing Bank Data
- Load the files and save it in an ARFF format
- Carry out a visual analysis of the dataset
- Drop the ID attribute
Which attribute is more relevant for our analysis?
- Sex (no relevant difference)
- Age (not as relevant as income, but there is a trend)
- Married (relevant)
-
Children (linear correlation, the first column does not respect the trend)
-
Income (normal distribution)
- The higher the income, the higher the probability to buy PEP

Visualize the Plot Matrix

The higher the number of children, the lower the income as children cost a lot of money.
People without children are not interested in PEP as they do not need to think about the future.
Classify
Use the following algorithms and evaluate the result:
- J48
- J48 (without post-pruning)
- Jrip
- IBk (with k=1 and k=5)

The main variables are children and income (closest to the root).
KNN with k = 1 using the training set has an accuracy of 100% because the closest point to me is me (if given), so we drop it.
With IBk we are using a distance function so numbers should be discretized.
Also, irrelevant and replicated attributes can create distortion in the result. - In this case, irrelevant attributes are most likely (sex, car)

We can drop the irrelevant attributes to increase accuracy.
Census Data
Identify relevant attributes:
- Capital gain/loss are not relevant because many attributes are zero
- Age (we expect it to be linear but the majority of >50k is in the range of 40 years)
- Work class is unbalanced (the majority of people works in private), coverage is very low
- Education number
- Marital status (interesting, married is different)
- Occupation (interesting)
Education and education-num are perfectly correlated (it is duplicated).

We can enhance accuracy by replacing missing values:
Preprocess - filter - unsupervised - attribute - replacemissingvalues
Apriori Algorithm
We apply discretization and perform a manual analysis of the data in order to identify any correlation between pairs of attributes.

Clustering with Weka
During the pre-processing part, the first thing to to do is to normalize the dataset (unsupervised filter) because clustering algorithms need a distance measure.
Then, we need to select the K-Means parameters:
- DisplayStDev: shows the standard deviation of the distances of individual points from the cluster center. The measure is reported separately for each attribute.
- Distance Function: distance function used in the calculation
- Maxiteration: maximum number of iterations to achieve convergence
- NumCluster: value of k
- Seed: random value for choosing the initial

We set the number of cluster at 3:

If we try with two clusters, the squared error is high (12.34), while with 4 clusters, the squared error is about 5.

By running the analysis with class to cluster evaluation, clusters are created based on their size.
We can improving the model by running the system in a two-dimensional space, selecting the two clusters with the lowest standard deviation.
Food Nutrient Dataset
The dataset is composed by 25 foods with their nutritional information based on the following KPIs:
- Energy
- Protein
- Fat
- Calcium
- Iron
The goal is to distinguish, based on the data, which ones are meat and which ones are fish.
We run the cluster analysis using k-means:

With two clusters, we can easily distinguish which cluster regards meat and which one is for fish.
With three clusters, C0 remains unchanged while characterization between C1 and C2 is relevant only for proteins.
Ended: DM - LAB
ML - LAB ↵
Laboratory
- Programing done in Python
- We will use Colab as a programming tool
- We can use any IDE (Visual Studio, PyCharm)
Guide
- Do not execute code without understanding it
- It is necessary to play a lot with the code to become familiar
- Bugs can be very informative
Colab
- Free Google service
- Easy to use, the environment provided already has most of the resources installed
- The code runs in the browser (VM)
- Each assigned virtual machine has a variable hardware equipment
- CPU and RAM available
- GPU resources
Ended: ML - LAB
ML - Seminar ↵
AI Solutions for Real-World Business
Presented by: Cosimo Fiorino - Head of Data Science
c.fiorini@ammagamma.com
The team of Ammagamma is various, composed not only by engineers, but also philosophers and designers.
Their vision is to develop a society aware of the potentialities, implications and impacts of technology.
As a mission, they offer the best choice instruments through the development of artificial intelligence innovative solutions.
They master cutting-edge technology to make company processes easier, curating the whole life cycle.
AI solutions implemented:
- Scheduling and optimized planning of production and organizational processes
- API for the forecast of future trends and historic series
- Data enrichment for the optimization of advanced analytics solutions
- Warehouse forecasting and replenishment for the predictive management of the warehouse
- Georouting API engine for the optimized planning of complex and bound scenarios
- Intelligent document processing platform enabling classification and extraction of the information
Shared Knowledge in the AI World
They develop educational projects in the intersection between the culture of data and humanistic thought, to lead people to the discovery of new growth frontiers.
Also, they devise new innovation paths that create a long-term value for the community and the territory.
We use numbers to transform chaotic systems into organized, intuitive and verifiable models
- Descriptive Algorithms: description of the variables that characterize a phenomenon to create its mathematical model (clustering, market basket analysis, BI systems)
- Predictive Algorithms: Information analysis to predict and control future behaviors of the phenomena (forecasting, regression, classification)
- Prescriptive ALgorithms: Development of prescriptive formulas to support and guide the process management (optimization, optimal control, next best action)
AI Projects Development
- Define the objectives and the context
- Feasibility analysis and solution design
- Proof od concept
- Pilot
- Deploy
- Scale up
- Post go-live and monitoring
What can go wrong?
DATA:
- Quality
- Null values
- Legal problems
- Security
- Dimension
COMPLEXITY AND BUSINESS CONSTRAINTS:
Constraints related to third parties in the business.
ADOPTION:
In the post deployment phase, the innovation department develop innovative solutions and then the department will not use it because they are skeptical.
Design Thinking
An approach to innovation that studies the adoption of a creative view to solve complex problems.

Make sure that when you build a model you will be able to reproduce it in the future and maintain it properly.
Data Ethic

Demand Forecasting & Inventory Optimization
In the food processing sector for the supply process of canteens, restaurants and indirect channels, it is extremely important to estimate the food products that will be sold to ease the order of raw materials necessary for production.
The target is to improve demand forecasting accuracy, reduce total stock level in the warehouse and reduce the total amount of wasted food.
Solutions:
- Demand forecasting algorithms adaptable to market movements and new products quickly
- Optimization models which, according to the predicted demand, suggest the best list of products
Marketing Management Optimization
Difficulty in managing and synchronizing the multiple campaigns aimed at costumer up / cross selling, retention and / or acquisition.
The approach, typical of product-centric strategies and dictated by objectives that are imposed on individual campaigns without considering the synergistic interaction between generating excessive push communication on the customer.
Recommender Systems
The goal is to find the perfect match between customers and products in order to suggest the right product at the right moment.
Item-based filtering
I suggest you product x because it is similar to the product you bought. The implementation needs a vectorial representation of each product in order to evaluate their similarity.
Collaborative filtering
I suggest you product x because people similar to you have chosen it. The implementation needs a vectorial representation of each customer in order to evaluate their similarity.
Popularity filtering
I suggest you product x because it is popular right now.
Serendipity effect
Serendipity is an unplanned fortunate discovery.
Hybrid filtering
When I have a lot of information I can use a combination of the preceding approaches.
Dynamic Recommender Systems
Recommendation is not only about the right product, but also the right moment to suggest it.
Natural Language Processing
The main concept behind NLP is embedding: exposing natural language in a vector in multidimensional place, maintaining the semantics.
The management of appraisal documents as part of the due diligence processes for non performing loans involves various resources in the back office department who are respnsible for visually checking the appraisal documents of judicial auctions.
AI Ethics
Gabriele Graffieti - CV Algorithm Engineer at Ambarella
Case Study:
You work in a large company which receives thousands of CVs daily. The openings are many and different from each other. Of course, skimming through CVs requires a lot of time and effort.
Good candidates can be erroneously discarded in this preliminary phase.
The idea is to develop an AI system that analyzes the CVs continuously.
Solution
Use the CVs of the current employees as ground truth data, in order to select candidates similar to the people that already work in the company.
Results
The selected people are very good candidates and the system performs better than our HRs in selecting good candidates.
Are we happy about this system?
This system was used by Amazon and it showed bias against women.
Issues
This problem is not easily detectable in the first place. On other hand, if we remove all the gender info from the CV, the model is still able to infer on that information.
Are you sure about your data?
- Have you checked the labels?
- Do you know how the data is labeled?
- Do you know who labeled the data?
Some tools like amazon mechanical turk can be used to label data but, some issues can be found with this tool: For example, emotion recognition can be labelled wrongly due to different cultures.
This aspect is particularly relevant with high risk AI applications, like:
- Diagnosis applications
- Control of critical infrastructure
- Law enforcement
- Scoring
- Hiring
Many people say that critical decisions should be taken by humans, however, humans are not perfect.
In the US, the best day to have a trial is Monday after a victory of the local football team
Human decision making is highly affected by mood, personal concerns, stress, level of sleep, affinity with the assessed person, stereotypes and so on.
What about human-AI collaboration?
It seems the perfect solution but humans can be biased by the fact they that believe that AI is always right so they tend to trust it.
Also, after some time, humans unconsciously trust AI and they no longer be able to spot errors.
On the other hand, what if AI is right but the human overcome the decision? And what if AI is wrong but it is so powerful that it can convince humans?
Are we sure that we are completely free of biases?
Right now, we have no answer to all these questions, because software development is not considered to be economically valuable.
In order to design safe AI systems, we should follow these steps:
- Alignment
- Robustness
- Corrigibility
Some countermeasures to adopt would be to use explainable models (deep learning, CNN and neural network are not explainable).
