AI Singapore’s Journey Into the World of Federated Learning

AI Singapore recently formed a team to build a platform for Federated Learning. This article is the first in a series that serves as a journal of our journey into the world of Federated Learning. It is not meant to be a survey of Federated Learning, which is itself a huge and active research area. Rather, we seek to capture our understanding of Federated Learning, why we feel it is important and what we plan to work on.

Why is Federated Learning needed?

We would like to start with the diagram below.

Cross-industry standard process for data mining (CRISP-DM, from https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining)

CRISP-DM (Cross-industry standard process for data mining) is an open standard process model that describes the common approaches used by data mining experts. There are also alternative methodologies, e.g. SEMMA by SAS and the more recent Team Data Science Process by Microsoft. Although it was conceived more than 20 years ago for the data mining community, the fundamental principles are still relevant for the AI/data science community at-large today: data remains the core, and there are some key stages through which projects are typically executed: business understanding, data acquisition and understanding, modelling (including data preparation/feature engineering, modelling, and evaluation), and deployment.

Data is Key

From a data management perspective, data acquisition typically starts with data discovery. Data discovery is the process of searching for existing data that is available for modelling either internally or externally in an organisation. With the rapid pace of digitalisation across most industries, more and more data is getting generated. Usually, data is generated by different systems. Furthermore, much of that data is generated without the intention of supporting specific downstream AI project needs in the first place. Therefore, it may not be possible to make the data easily discoverable. Thus, there is a need for a data catalog system. This is one of the reasons driving the need for a common data platform in AI Singapore.

Suppose we discover some data which could be relevant for the problem statement. In many cases, the data is incomplete and needs to be augmented with additional information. For example, a model which predicts whether an insurance policy is likely to lapse is useful for the insurers. Nevertheless, data within the insurance domain alone would be insufficient to build a model with high accuracy. Lapse could be driven by factors that are not captured by the insurance companies. For example, a lapsed policy could be due to the fact that the policy holder has lost income. Banks could have such personal data that could be used to augment insurance data, if it could be shared.

On the other hand, there are data privacy concerns when it comes to cross-organisation personal data sharing. There is an increasing awareness of staying compliant when sharing under regulations such as GDPR and PDPA. Even if it can be ensured that the personal data sharing stays compliant, there are plenty of business/commercial considerations that do not give enough justification/incentive for organisations to share the data they have acquired.

So the question now is how we can better utilise the data that is scattered across different systems in a privacy-preserving manner. We believe Federated Learning is a promising solution.

What is Federated Learning?

Federated learning is a machine learning technique that trains a model across multiple decentralised parties holding local data, without exchanging them. It is different from conventional machine learning techniques where all the local datasets are uploaded to one centralised location.

An illustration of Federated Learning

Don’t be bogged down by this seemingly complex diagram. Here are the main steps involved:

  1. A group of parties (with local data) come together and form a network, with the common goal to train a model together. The number of parties varies depending on the use case. They agree on the type of model to be trained.
  2. The Trusted Third Party (TTP) acts as the coordinator (it does not contribute data). It sends this model to all the other participating parties. This model would serve as a baseline for each individual party to start training with only local data.
  3. Periodically, all parties send their learning (weights, gradients, losses, etc.) to the TTP. NO local data is ever exposed. For better data protection, mechanisms like homomorphic encryption or secure multi-party computation can also be used.
  4. The TTP then aggregates the new learnings from the parties and continues to improve the shared model.
  5. The new shared model is again sent back to the participating parties and the same cycle repeats again and again. With each iteration, the shared model maintained by the TTP gets better.

There are variations of Federated Learning, depending on how the different steps above are conducted. For example, the TTP role is not always needed to coordinate the model aggregation. It is key to point out one important variation due to how data from the different parties “overlap”.

In the example above, where data from the insurance company can be augmented with data from the banks to build a better policy lapse model, different parties have a big overlap in the user space, but a small overlap in terms of feature space. This scenario is called Vertical Federated Learning or Feature-based Federated Learning. In this scenario, typically only one party has the label y (or ground truth). In the insurance policy lapse example, only the insurance company has the label where a policy has indeed lapsed.

Vertical Federated Learning

There is another scenario where different parties have a big overlap in the feature space, but small overlap in the user space. This scenario is called Horizontal Federated Learning or Sample-based Federated Learning.

Horizontal Federated Learning

In this illustration, Party A, B, and C have data with different IDs, while they share the same set of features x1, x2, and x3. Each of them also has the label for the data it owns. The most famous example of Horizontal Federated Learning application is Google’s Gboard. In this application, billions of mobile devices come together to train Gboard’s query suggestion model. Different devices would collect the same set of features.

Horizontal and Vertical Federated Learning would require different treatments when aggregating learnings from different participants.

What is AISG doing in this area?

Our goal is to build a platform to support Federated Learning. Some of the key features include (but not limited to):

  • It should support both Horizontal and Vertical Federated Learning.
  • It should support most of the mainstream machine learning algorithms, such as logistic regression, tree-based algorithms, deep learning.
  • It should support multiple federated aggregation mechanisms. No matter whether it is Horizontal or Vertical, one of key problems in Federated Learning is how to deal with the statistical heterogeneity in the data of the different participants. In the Federated Learning setting, it is quite common that the data owned by individual parties are generated by different processes. Such a data generation paradigm violates the independent and identically distributed (I.I.D.) assumptions frequently used in conventional machine learning techniques. How to effectively address this remains an active research topic. The most commonly used one is FedAvg. It is known to have some drawbacks, and many improvements have been proposed, e.g. FedProx. Nevertheless, those proposed improvements have their own pros and cons too. Our platform would provide users options to choose different mechanisms.
  • It should provide a contribution and incentive mechanism. As shown above, one of the main benefits of Federated Learning is that it enables collaborative model training without the individual parties exposing its training data. But this is a double-edged sword. It also opens the door for the “free-rider“, i.e. participants who try to profit unilaterally by deliberately injecting dummy data into the training process. An extreme of such a scenario is when a participant launches a data poisoning attack. How to eliminate data poisoning attacks under a Federated Learning setting is still an active research area. As a first step, a contribution and incentive mechanism could help to find out who are the potential free-riders so that the collective benefit of all the participants could be optimised.
  • It should support auto tuning of hyper-parameters.
  • It should be easy to deploy and use. The platform would support deployment via Docker Compose. It will also be integrated with lifecycle management platforms like MLflow to provide a one-stop management of the whole cycle of Federated Learning training from reproducibility to model registry to deployment.
  • It should be tightly integrated with our common data platform. We take a holistic view of data management and machine learning in AI Singapore. With this integration, users will be able to search/discover data contributed by other users which could be relevant for their Federated Learning model.

There may be questions why AI Singapore is investing effort in building yet another Federated Learning platform given that there are a handful of them out there, including TensorFlow Federated and FATE (Federated AI Technology Enabler). We see potential applications of Federated Learning when we are collaborating with our industry partners to deliver our flagship 100E programme. During our search for suitable solutions, we see some limitations of existing frameworks/platforms. To name a few reasons:

  • Some of them only support Horizontal Federated Learning. We see applications of both Horizontal and Vertical Federated Learning in various use cases.
  • Most of the existing framework/platforms assume that all participants would build local models of the same configurations. This may not be the ideal situation, as some parties may achieve a more optimal local model with a configuration different from other parties. A related technique is Split Learning. With Split Learning, different parties jointly train a neural network. Each party trains the first few layers locally up to a specific layer known as the cut layer. The outputs at the cut layer can then be sent to the TTP, which completes the rest of the training without looking at raw data from any party that holds the raw data. Split Learning can be configured to support different parties training locally a different network configuration. Split Learning will also be implemented as one of the supported federated aggregation mechanisms.
  • There is no explicit treatment of the “free-rider” problem. This problem has been extensively studied in the area of peer-to-peer system, but not so in Federated Learning. We feel that this is something important to deal with in order to keep Federated Learning more sustainable. We will also design and implement a contribution and incentive mechanism as a first step to address this issue.
  • Most of them do not have end-to-end integration with other systems, like data catalog and lifecycle management. We feel that all these are important components that make an ML system work within a production setup.

The points mentioned above should not be taken to suggest that we feel that our platform is superior to the existing ones. Quite on the contrary, being a late comer in this field, we have drawn a lot of inspirations from them. We have also been actively contributing back to some of the open source projects in the field. For example, we have been actively contributing to the FATE community.

The area of Federated Learning is evolving rapidly. Building a platform from the ground up and applying it in projects will equip our apprentices with important skills to deal with AI projects in a real-life environment which is getting more privacy-aware. It will also give our apprentices the opportunities to learn together to avoid hidden technical debts.

In the subsequent articles in this series, you will hear more about how the platform is architected and designed, how certain important issues in Federated Learning are addressed, and about some use cases with the platform. Stay tuned!

The Federated Learning Series

Author