Perceptro: AI for a better Customer Internet Connectivity Experience
Every interaction, qualitative or otherwise, that the customer has with a service goes towards building the customer’s perception — and when we quantitively measure customer experience, the results we get are a reflection of this perception.
In the world of Telecommunications, interactions are typically seen as operational — customers walking into a store, calling up a call center, interfacing with our applications and e-shops, etc. However, those are not the interaction points where customers are using our services, rather, these are the touchpoints where physical interactions occur while customers are using our core services. We are the backbone of the internet and every hit to google.com or medium.com is an interaction that customers are having with our service.
Hence, to provide the best customer experience as a Telecommunications organization, we need to ensure that every hit, every click and every navigation on the internet is seamless, without a delay and without a fault.
There are 2 areas that can be influenced to achieve an always efficient connected experience:
- Intrinsic: Systems and processes inside our network — these are known and directly impactable issues — we understand the environment and the complexity of the issues herein and can directly take actions to resolve them.
- Extrinsic: Usage of our services and the environment within which our devices are installed and operating — there are constant efforts to identify gaps in this area, however, there are limitations to how much we can do, especially to deflect rather than fix.
This was the problem statement we set out to solve — can we remotely take actions to continuously provide a reliable connected experience to our Fixed Services (Broadband / Internet, TV, Voice) customers.
Connectivity Issues lead to “Detractors” and have long term impacts on business performance
Our commitment to working on this as a key area, while logical, is not devoid of quantitative foundations. CX has been an area of avid study and research in the last few years at Telekom. We have brought in platforms focused on increasing the amount of data we collect about customer’s perception about our services. Furthermore, we have our best team members working on understanding this data and finding the key factors which lead to lower customer experience, or who are known as “NPS detractors” — customers who would not highly recommend our services to their family / friends.
In one such study, our team of data scientists made a statement…
Customers who call up our call center to raise an internet connectivity fault have a 3X higher propensity to become detractors.
Combining this with other sources of information:
- Fact: A large part of our customers tend to call us with a technical issue atleast once a year.
- Knowledge: Majority of those who face regular issues might not even be calling us — Habit Formation Psychology 101: People develop countless habits as they navigate the world, whether they are aware of them or not.
This is a large problem — On one hand, customers calling us are definitely detractors and are also a large cost center for us. On the other hand, customers who are not calling us are an even bigger problem — we are not even aware which of these customers are facing regular issues in connectivity, we dont know who to address — infact, in most cases, this perception might be latent in itself but would surely be impacting our business performance.
Vision for Perceptro: A platform to resolve connectivity issues across internet & TV before they happen
We set forth an ambitious vision for ourselves — we aspired to establish systems and processes which could resolve these issues before they happen thereby ensuring that customers services are always up and running with the best possible quality.
To achieve this, we needed to answer 2 questions:
- Who are the customers who are facing a problem? — This was a problem we wanted to solve, however, it is an extremely hard one and I’ve explained more on these reasons later in the post. One thing, though, was clear — we would need an advanced solution to manage a landscape of millions of customers!
- What should we do to remotely resolve problems? — There’s a range of different actions we can execute remotely to resolve connectivity issues — some are free, others are not; some can be purely remote, some require customer’s intervention and knowledge; all require a decision engine to prioritize and orchestrate!
Our vision started forming keeping these concerns in mind:
We did not go ahead and implement this directly, though — we took a calculated approach to first validate and establish the need. As with any product, we had to be careful — before going too big, which would have been expensive and time consuming, we needed to validate our core assumption:
There is a need for an intelligent approach i.e. Can we take simple actions, remotely, across all our customers to enhance customer experience?
Early Approach: Remote Actions on All Customers
As is true for any initiative, we needed a KPI to monitor the performance of our action. Our end goal was to impact NPS, however, since that is a lagging KPI i.e. it takes longer to change perception, we needed something that could be monitored closely in the short term as a preceding indicator of NPS performance. Accordingly, we chose “Faults” (i.e. calls at our call center for technical issues) as the Key KPI to monitor and reduce. Now, we understand that not all customers facing an issue will call us — however, if we are able to reduce issues for those who are recognizing them, we would also be improving experience for those who are not.
Our early approach was to seek out the best possible solution that could help in reducing this problem for us. We convened our best in-house experts and gave them an extremely hard question to answer:
Which action can we take, remotely, without customer intervention or knoweldge, to resolve as many problems as possible in a customers internet connectivity?
With years of experience in managing large networks, it was clear to us from the beginning — the simple action of rebooting one’s router is sometimes the most effective in resolving issues — it’s like your doctor telling you to go home and sleep off a minor infection!
And that’s what we did — we kicked off an A/B experiment to study the impact of routinely rebooting routers in our network. Now, here, I would like to point out that this was really not a novel idea at our end — other telco companies have done this in the past and have even shared it with us as a best practice. However, our results from this exercise (showcased below) came as a surprise!
Our results: Blindly rebooting customers was causing more problems that it was solving. We ended up increasing the number of faults from our target segment (i.e. the customers we were rebooting), as opposed to reducing them.
Our takeaway: We need to be intelligent!
While our initial results were unexpected, our perseverance to achieve the outcome has displayed exemplary commitment to our customers!
These initial results validated our initial assumption i.e. we need to be intelligent about which customers we are targeting— Arbitrarily taking an action on a large set of users probably causes more problems than it sovles, hence, we need to be targeted in our approach — we need to come up with a methodology to identify a segment of customers who have a higher propensity of facing significant issues in the near future and only focus on them.
Enter: Artifical Intelligence
Exit: Everything we knew about building and deploying a product!
Any artificial intelligence or machine learning model development lifecycle goes through the same steps, namely,
We chose to start with technical data to be sourced from the broadband routers deployed in customers’ homes. The scale of this data set is insane! We were collecting these metrics every 15 mins from each router — so about 96 data points per router — a typical time series dataset. As an example — If we consider a user base with, say, 1 Mn routers, we had more than 90 Mn observations for every day. Each of these 90 Mn observations was comprised of 350 different attributes / data points pertaining to technical information about the router and it’s current performance.
Now, for any ML model — we need an output label — which, in this case, was “faults”. Considering again the example of 1 Mn routers — the faults that were recorded were very low — ballpark about 2,000 faults in a day from these 1 Mn router devices. So, our input matrix had 90 Mn observations with only 2,000 postitive cases. Hence, our challenge was to train a model which could handle a 1 in 40,000 probability scenario. In layman’s terms, this is better known as…
Finding a needle in a haystack!
This specific scenario, in machine learning terms, is also known as “class imbalance”. Basically, we were trying to categorize a set of technical parameters received (“observations”) from a customer’s router in one of 2 “classes” — will report a fault (class of interest: positive class) OR will not report a fault (negative class). And, in our dataset, the number of observations having negative class are far far far far far far far far more than the number of observations having positive class — hence, “imbalance”.
Our first task was to extract relevant information from the 350 different data points contained in each observation.
Our first mistake: This is not a single time series dataset — these are millions of time series datasets!
We started by doing high level rules on these data points — for egs. let’s look at routers of brand X when their free memory drops below 20% — maybe there is a correlation with faults reported? These, of course, did not work (else, I wouldn’t be writing this post 😄).
Accordingly, we started creating features on a per router basis — our critical features consisted of things like changes in the different metrics compared to the time series history of the same router itself. As an example — the moving average of free memory over last 3 days as compared to moving average of free memory over last 10 days.
After running a lot of analyses and experiments, we finally chose 10 different “features” to use in our model training and development.
As an additional note while preempting some questions on the readers’ minds … I talked about class imbalance (the problem of having a lot of observations with one class and very few observations from the other class)— in the world of machine learning — this is an extremely hard problem to solve. Typically, this is solved by either generating more observations of the positive class (aka oversampling) or only choosing a random subset of the negative class (aka undersampling). We, of course, tried these approaches. However, given the scale of the imbalance — 1 positive observation for 40,000 negative observations — we were getting extremely low precision (%age of predictions that we made that were accurate) as well as low recall (%age of the class of interest i.e. faults that we were able to predict accurately) when we were running the models on the real scale of the data.
Hence, finally, we used all the data without applying any sampling techniques.
Train / Validation / Test Dataset Preparation:
Now that we had our features prepared — we needed to decide on our testing strategy.
Given, atleast, the extent of our knowledge with regards to time series data — this was an easy choice to make. We had developed features on approximately 3 months of data — we used 2 months of data to create our Training sets (using cross-validation while training) and the last 1 month of data as our sanitized test dataset.
Model Training & Results:
As we have a preference towards testing with the minimum viable solution to start with, which I guess one might have noticed by now, we started with using a simple Logistic Regression model. The results were not very encouraging.
So, it was time to move on to more refined approaches.
Having spent a lot of time doing exploratory analysis on the data earlier, we were quite optimistic about decision trees from the beginning. From a business perspective, the traditional approach has always been to put “thresholds” for various technical metrics and make “decisions” based on them. That’s pretty much what decision trees do, except, they can manage a much more complex matrix of thresholds. And that’s the reason behind our optimism!
To cut a long story short, after experimenting with a lot of different decision tree modeling techniques, we finally chose the XGBoost flavour as the winner, giving us the best results in terms of precision as well as recall. Below is a fabricated confusion matrix (since I cannot share corporate data on a public platform) to give you an idea of our results:
As the astute ML expert might be able to notice in a heartbeat — the results are still not very good! Our accuracy metrics are really high (which is very typical of data sets having high class imbalance), however, our precision (i.e. percentage of our predictions for Class 1 which were actually correct) was extremely low. So, we were predicting a lot of routers which might have issues, but a very small percentage of them were actually reporting faults.
What I am about to say will come as a surprise for most data science professionals,
Even less than perfect results are usable!
We did not have a good precision! We had too many false positives! The algorithm was just not performing as expected!
But, let’s take a step back — our objective was not to create an awesome model — our objective was to identify a targeted segment of customers, who we can take remote actions on, in a bid to improve customer experience. The cost of our remote action i.e. a reboot, was zero! We had a limitation o the number of reboots we could trigger every night and our daily predictions was below that. So, why bother with precision — let’s focus on recall (i.e. percentage of actual class 1 observations that we are able to capture with our predictions)! Our recall was quite healthy — all in all, we were able to identify a small segment, less than 1%, of our customers who would report 10% of our faults — now that’s something worth experimenting with!
We went back to the market and ran an A/B test again. With a productionized model predicting a small set of customers every day, we started doing remote reboots only for these customers. The results were positive — we were now reducing faults from these customers. Considering that randomly selecting and rebooting routers remotely was leading to an increase in faults — we were performing much better than our naive / base model.
Eureka! It Works!
And, that’s the story of how we’re, now, constantly monitoring our customer’s routers, day and night, with our machine learning models and taking actions remotely to ensure that they enjoy an always connected experience!
But, this is not the end — this is the beginning…
Scaling Perceptro: The Future
Perceptro — Our platform to perceive a problem with internet connectivity for our customers
We’re now working on the next phase of evolution based on the foundation and learnings that our first minimum viable product has created. We see value in scaling our efforts and expanding in 2 areas:
- Model enhancement — While our model is being able to drive a certain level of recall at this stage, we need to improve our existing models, or add new models altogether, with a view to improve recall further, while also optimizing precision.
- Actions portfolio — Today, we are only taking extremely simple remote actions (rebooting a router). We are working on taking more advanced actions (again, mostly free from a cost perspective) to maximize the issues that we can deflect.
Our full vision of Perceptro includes increasing the amount of data
It’s been an extremely interesting journey to get here — our next steps are even more ambitious, but, we have established we are in the right direction and we’re confident we have the right competencies to achieve our lofty goals!