Discussion What are some effective strategies for managing data collection and preparation in ML projects? How do you ensure data quality?
To a project manager on Machine Learning (ML) projects, data is an important aspect to manage and track closely, so as to ensure successful and timely project delivery. There are many situations where data could end up affecting or even blocking project progress:
- Non-availability or insufficient data (quantity)
- Suboptimal data quality, e.g. missing values, missing or incorrect annotations, etc.
- Data not representative of actual use cases
- Bad data distribution such as class imbalance
By taking the following points into account, a project manager can better manage the data concerns. It is essential to understand that the quality and organization of data have an impact on the performance of ML models, and as a project manager, one needs to ensure that the project team identifies potential data risks and tackles any data issues in the early stage of the project.
- Data Collection: Maintain clear, well-documented data collection protocols for consistent and reliable outcomes
- Data Understanding: Analyse and understand the data and its sources to foresee potential quality issues
- Data Cleaning: Ensure a process for data cleaning and managing missing, duplicate or inconsistent data is in place
- Date verification and validation: Establish procedures for data verification, combining automated checks with manual review
- Data Governance: Implement clear data governance policies, covering data access, security, privacy, etc.
- Data Split: Ensure that there is a training, validate and test data split strategy, preferably a reproducible one
- Data Documentation: Keep thorough data documentation for uniform understanding
- Tools: Equip the project team with appropriate data quality management and version tracking tools
- Team Culture: Foster and cultivate a team culture that prioritises data quality and encourages team responsibility for improvement
What are some of the obstacles and challenges with regard to data that you faced when managing your ML projects? Feel free to share the approaches you took to control and resolve these issues.