Call Quality at Aircall, Part 2: A Case Study on Threshold Optimization for the Quality Indicator

Artur TrofymovLast updated on January 2, 2024
10 min
Select chapter

Ready to build better conversations?

Simple to set up. Easy to use. Powerful integrations.

Get started
Select chapter

Ready to build better conversations?

Simple to set up. Easy to use. Powerful integrations.

Get started

In our first article, the Aircall Product Analytics team looked at the general overview of the call flow at Aircall, listed the call quality indicators and their main limitations.

The main goal of this article is to define a single technical indicator that describes the user satisfaction on call quality in the most accurate way. We correlate MOS (Mean Opinion Score), as a source of technical information, with the PCM (Post-Call Modal) - user rating on a scale from low (1 star) to high (5 stars) quality of a call.

1. First observations about the collected data

At Aircall, we collect historical data on possible call characteristics across all legs (parts of the call) and participants. For the threshold optimization study, we are interested in:

  • the technical VoIP indicator (MOS)

  • the user satisfaction indicator (PCM)

We conduct a data analysis using a sample of call information collected during 30 days. However, we observe several limitations in the data availability:

  • reliable information on MOS is available only for the Client leg (representing a part of the call flow)

  • only 0.5% of all calls have users' feedbacks

2. Exploratory Data Analysis: MOS & other technical indicators

For each of the technical indicators (VoIP Call Quality indicators), we use the average value over the call duration. Thus, we have the overall estimate of the call quality indicator without going too deep in the granularity. For simplicity, we will be omitting the word “average“ before the indicator name.

2.1 MOS distribution

Looking closely at our MOS numbers, we are convinced that Aircall's quality is at the forefront of the industry standard. On average, calls at Aircall have a MOS value of 4.34, which corresponds to a good quality on the MOS scale. The vast majority (97%) of the calls have MOS greater than or equal to 4.0. Nevertheless, the remaining 3% of calls are of big interest because they are a potential source of information for improvements.

The narrow MOS window from 4.26 to 4.42 contains 78% of all calls. From the statistical point of view, such highly left-skewed MOS distribution indicates that the data set is imbalanced.

With 97% of calls being above 4, the 5-category MOS scale is unlikely to be meaningful.

2.2 MOS and call quality issues

Before evaluating how accurately MOS describes user satisfaction with call quality, we will test how MOS correlates with quality issues due to other technical indicators. We defined calls with quality issues where either Latency, or Jitter, or Packet Loss exceed the thresholds.

To perform such a test, we calculated the share of calls with call quality issues per MOS value. The corresponding visualization is given in Figure 1, where the red line represents the percentage of calls with quality issues. It fluctuates from 17% to 38% when MOS is between 4.0 and 4.26 (black dotted delimiter).

Moving from 4.26 to 4.36 (green dotted delimiter), the share of calls with issues falls down drastically from 37% to 3.5%. For the rest of the spectrum - where MOS is greater than 4.36 - the percentage of calls with issues slowly reduces from 2.3% to 0.2%.

Figure 1: Share of calls with issues per MOS value.

The high instability of the spectrum in the MOS region below 4.26 as well as the high percentage of calls with issues, indicate that the industry standard threshold value of 4.0 on MOS scale to select good quality calls is not efficient. MOS thresholds from 4.26 to 4.36 shows much better sensitivity to quality issues as the corresponding share of calls reduces drastically.

The rate of calls with issues is the smallest when MOS is above 4.36, on average it is 1.1%. Therefore, the value of 4.36 can be considered as the most efficient alternative compared to the standard threshold of 4.0 in the MOS scale.

3. MOS and PCM

To test how the call quality estimation correlates between a technical point of view and a user satisfaction perspective, we compared the MOS indicator and PCM feedback. In the ideal case, all calls reported with good PCM should have MOS value which corresponds to a good call quality.

3.1 PCM distribution

Similar to MOS, the PCM distribution has a high level of left skewness, meaning that the vast majority (92%) of calls are rated with 4 or 5 stars. According to users, 92% of calls don’t have any quality issues. The rest 8% of the calls are rated with 1, or 2, or 3 stars indicating quality issues.

3.2 Average MOS and PCM

As a first check of the relation between MOS and PCM, we looked at the average MOS per PCM category. The summary of the correspondence is presented in Table 1. Here are few important observations:

  • the higher the technical quality of the call, the higher the user satisfaction

  • for all five PCM categories, the average MOS value is higher than 4.26 and stays in a good quality region (with respect to industry standard threshold standing at 4.0).

The first observation implies a positive correlation between the average MOS and PCM. The corresponding Pearson correlation coefficient is 0.93, p-value is 0.02 at significance level of 0.05. Nevertheless, this level of correlation does not automatically translate to the correlation between MOS and PCM values per call, as well as it does not mean a causation.

The second observation challenges the industry standard MOS threshold of 4.0 for good quality calls. It demonstrates the weakness of the threshold and can lead to situations where calls have quality problems according to PCM while MOS reports a good quality.

PCMAverage MOS
1
4.28
2
4.34
3
4.34
4
4.37
5
4.38

Table 1:  Average MOS value per Post-Call Modal quality.

3.3 Thresholds and coarsening of scales

Since both indicators, MOS and PCM, have highly skewed distributions, it makes sense to convert their scales from five to two levels. In case of PCM it is natural to unify cases with 1, 2, 3 stars to a single category of bad quality calls. Calls rated with 4 or 5 stars, are unified to the category of good PCM quality. You can find more details in the previous article. In this way we also avoid having categories with highly suppressed statistics.

In the case of MOS, we do the same, with the only difference being that there are several candidates for the good quality threshold. The standard scale threshold of 4.0 showed its weakness in both cases:

  • when we studied the share of calls with quality issues per MOS value using the other technical indicators (as shown in Section 2.2)

  • when we looked at the average MOS per PCM category (as shown in Section 3.2).

Based on the first case, the most robust threshold for separating calls with quality issues from good quality calls is 4.36. This threshold divides the MOS distribution into two parts: 12% of calls with bad quality and 88% of calls with good quality (versus 3% and 97% with 4.0 threshold). To understand the effect of the threshold shift we will test both of them. Each MOS threshold corresponds to a model:

  • Model 4.0 (use of industry standard 4.0 value)

  • Model 4.36 (use of 4.36 threshold).

4. Models performance evaluation

After converting MOS and PCM to categorical scale, the goal of PCM description with MOS became a binary classification problem. Therefore, we would treat PCM quality as an “actual” class and the measured technical quality as a “predicted” class.

4.1 Confusion matrix

To summarize the performance of calls classification with two models, we used the confusion matrix technique. A confusion matrix is known as a special two-dimensional table, where one dimension represents instances of actual class and the other dimension - predicted class. Its aim is to demonstrate whether the model is confusing two classes.

As we saw, both of our actual and predicted classification sets are imbalanced. For the imbalanced classification problems the majority class is usually referred to as the negative outcome, while the minority is referred to as the positive outcome. Therefore, good quality calls will be referred to as the negative outcome (0), and bad quality calls will be referred to as the positive outcome (1):

Condition Positive (P / 1)
Number of real positive cases (when PCM quality is 1/2/3)
Condition Negative (N / 0
Number of real negative cases (when PCM quality is 4/5)

Following this prescription a confusion matrix elements are defined as:

True Positive (TP)
Test result that correctly indicates the presence of a condition (when PCM is 1/2/3 and MOS indicates bad quality)
True Negative (TN)
Test result that correctly indicates the absence of a condition (when PCM is 4/5 and MOS indicates good quality)
False Positive (FP)
Test result which wrongly indicates that a particular condition is present (when PCM is 4/5 but MOS indicates bad quality)
False Negative (FN)
Test result which wrongly indicates that a particular condition is absent (when PCM quality is 1/2/3 but MOS indicates good quality)

The resulting confusion matrices for models based on MOS with threshold of 4.0 and 4.36 are shown in Figure 2.

Figure 2: Confusion matrices using the MOS threshold 4.0 (left) and 4.36 (right).

We can use these matrices to quantify the classification predictive power of both models. For the imbalanced dataset the most effective are the metrics which summarize how well the positive class was predicted:

  • recall, which monitors the true positive rate and is defined as a ratio of true positive cases to all actually positive cases: TP / (TP + FN)

  • precision, which shows the fraction of correctly predicted positive class from all cases predicted as positive class: TP / (TP + FP)

In terms of statistical hypothesis testing, recall is oriented towards monitoring of type II error (FN), while precision monitors type I error (FP). Both recall and precision are of equal interest for us, because FP and FN have the same importance in the correct understanding of call quality.

4.2 F1 score

We are equally interested in contributions of FP and FN elements in the matrix. Therefore, both recall and precision should be taken into consideration in the model performance evaluation. The F1 score is well suited for this goal, as it takes recall and precision with the same weight, and is defined as their harmonic mean:

From the F1 definition, we clearly see that when precision is equal to recall (a trade-off case), the F1 becomes equal to both of them and is a pure linear function:

Figure 3: F1 function when Recall is equal to Precision.

The minimum possible value for F1 is 0, meaning the worst model performance, and the maximum is 1, the ideal case of perfect modeling.

4.3 Models performance discussion

Following the F1 score definition, the models performance using the standard MOS threshold of 4.0 and data-driven threshold of 4.36 are evaluated. The F1, recall and precision values are summarized in Table 2.

F1RecallPrecision
Model 4.0
0.12
0.08
0.28
Model 4.36
0.23
0.34
0.18

Table 2: Models performance summary.

Model 4.36 demonstrates an overall better performance, as F1 is almost twice higher than for Model 4.0. Model 4.0 has a big imbalance between recall and precision, where recall is 3.5 times lower than precision.

In practice, it means that Model 4.0 is heavily biased towards underestimating the contribution of calls with poor PCM feedback. From 8.2% of all calls rated with a bad PCM rating, 7.6% have good quality due to MOS (see the left confusion matrix in Figure 2). Model 4.36 has a smaller imbalance between recall and precision which goes in the opposite direction, with respect to the Model 4.0. Specifically, recall is 1.9 times higher than precision. The only advantage of Model 4.0 over Model 4.36 is that the precision of Model 4.0 is 56% higher.

Summarizing the models performance, we clearly see that Model 4.36 is the better of the two models (F1 being two times higher).

Summary

In this article, we looked into the example of tuning and testing the MOS thresholds using the confusion matrix technique. The study is based on a 30-day call data set with several constraints:

  • MOS is available only for the Client leg

  • only 0.5% calls have users' PCM feedback

Consequently, we compared the technical indicator that reflects a part of the call flow, to the user feedback indicator that is affected by the full call flow.

At Aircall, 97% of calls have MOS greater than or equal to 4.0, which corresponds to a good quality on the MOS scale. To find the most effective MOS threshold, the followings two models are tested:

  • Model 4.0 with the industry standard MOS threshold of 4.0

  • Model 4.36 with the data-driven threshold of 4.36

The performance of these two models is estimated using the F1-score measure. The Model 4.36 demonstrated 92% better performance than the Model 4.0. It also has a twice better balance between the type I and II errors. However, the obtained performance of Model 4.36 is insufficient to accurately describe the PCM quality received from Aircall customers:

  • 13.2% of all calls have PCM of 4/5 stars but MOS indicates poor quality

  • 5.4% of all calls have PCM of 1/2/3 stars but MOS indicates good quality

Therefore, with the alternative MOS threshold of 4.36, we significantly improve the description of the call quality experiences by our customers. Moving towards further improving the understanding of surveyed quality, we will address the limitations on MOS and PCM availability in an upcoming in-depth study.

 


Published on January 2, 2024.

Ready to build better conversations?

Aircall runs on the device you're using right now.