DomainNet Reproduction Issues: OT-Cost, Forgetting, Accuracy
Hey everyone! I'm diving into reproducing the results from the DomainNet research, presented in the CVPR25-DUCT paper, and I've hit a few snags. I wanted to share these issues and hopefully get some insights or guidance from the authors and the community. It seems like there are some discrepancies between the code, the logs, and the reported results, so let's break it down.
1. DomainNet Fails to Run
The first major hurdle I encountered is getting DomainNet to train at all using the current repository. This is a pretty fundamental issue, as it prevents any further experimentation or result verification.
Specifically, I haven't been able to successfully train the model on the DomainNet dataset using the provided code. This is a critical first step, and without a working configuration, it's impossible to move forward with reproducing the paper's findings. To address this, I've been trying different configurations and debugging the setup, but so far, no luck.
Request for Assistance: I'm reaching out to the authors (and anyone else who might have experience with this) to ask if they could share a working configuration file or script that can successfully train DomainNet end-to-end. Any instructions on dataset preprocessing steps would also be incredibly helpful. It would be awesome if we could get a clear, reproducible starting point for this dataset.
2. OT Cost Mismatch (mbtc.py vs. duct.py)
Digging deeper, I noticed a discrepancy related to the Optimal Transport (OT) cost calculation. Reviewing the logs provided, which seem to reference mbtc.py, I see printouts for “The OT cost…”. However, when running the provided duct.py script, there's no evidence of OT-cost computation or print statements.
This raises a pretty important question: Are the logs provided from a different or older code path, possibly using mbtc.py instead of the current duct.py? If that's the case, it's crucial to know the exact script or commit that generated those logs. It's like trying to follow a recipe with missing steps – you need the full picture to get the desired outcome!
Question for the Authors: Could you clarify whether the logs originate from an older codebase or a different script (like mbtc.py)? If so, sharing the exact script or commit that produced those logs would be incredibly helpful. Alternatively, if the OT step is supposed to be enabled within duct.py, could you provide instructions on how to do so?
3. CDDB “Forgetting” Discrepancy
Now, let's talk about the CDDB (Continual Domain-balanced Distillation) results. There's a noticeable difference in the reported “Forgetting” metric between the logs and the paper. The CDDB log shows a Forgetting value of 6.145 after the last task, but the paper reports a much lower value of 0.12. That’s a significant gap!
This kind of discrepancy can arise from various factors, such as different definitions of the forgetting metric, different scales used in the calculation, or variations in the evaluation protocols. For example, the averaging method (per-task or overall), normalization techniques, the class-increment schedule, or even the order of domains could all influence the final forgetting value. Figuring out the exact cause is key to understanding and reproducing the results.
Question and Request for Clarification: It would be super helpful if the authors could clarify the exact formula and evaluation script used to compute forgetting for the paper. Sharing the code that reproduces the 0.12 value would be fantastic. This level of detail is crucial for ensuring reproducibility and validating the research findings.
Delving Deeper into the Forgetting Metric Discrepancy
The forgetting metric is a critical aspect of evaluating continual learning algorithms. It quantifies how much a model loses its performance on previously learned tasks as it learns new ones. A large discrepancy in this metric, as observed between the logs and the paper (6.145 vs. 0.12), can indicate significant differences in the evaluation methodology or the implementation of the algorithm.
To get to the bottom of this, we need to consider several factors that could contribute to this difference:
- Definition of Forgetting: There are different ways to define and calculate forgetting. Some common approaches include averaging the performance drop across all previous tasks or focusing on the performance after the last task. The specific definition used can significantly impact the numerical value of the metric.
- Scale and Normalization: The scale at which forgetting is measured can also play a role. For instance, the metric might be normalized by the initial performance on each task or by the range of possible performance values. Normalization methods can help to make the metric more comparable across different tasks and datasets.
- Evaluation Protocol: The evaluation protocol, including the order in which tasks are presented and the number of training steps per task, can influence the amount of forgetting observed. Different task orders can lead to variations in the model's ability to retain knowledge.
- Class-Increment Schedule: In continual learning scenarios, the way new classes are introduced over time can affect forgetting. A gradual introduction of classes might result in less forgetting compared to introducing all classes at once.
- Domain Order: The order in which domains are presented to the model can also impact forgetting. Some domain sequences might be more challenging than others, leading to greater knowledge loss.
Why Precision in the Forgetting Metric Matters:
Understanding and accurately measuring the forgetting metric is essential for several reasons:
- Benchmarking Continual Learning Algorithms: It provides a standardized way to compare the performance of different continual learning methods.
- Diagnosing Model Limitations: A high forgetting value can indicate that a model is struggling to retain previously learned knowledge, highlighting areas for improvement.
- Developing Mitigation Strategies: By understanding the factors that contribute to forgetting, researchers can develop techniques to reduce its impact, such as regularization methods or memory replay strategies.
In the context of the DomainNet and CDDB experiments, a clear understanding of the forgetting metric is crucial for validating the results and ensuring that the proposed method effectively mitigates forgetting in continual learning scenarios. Sharing the exact code and methodology used to compute the forgetting metric would greatly enhance the transparency and reproducibility of the research.
4. Large Accuracy Gap vs. Logs
Finally, and perhaps most concerning, is the substantial accuracy gap I'm seeing between my results when running the provided repository and the numbers presented in the shared logs. Replicating experimental results is a cornerstone of scientific research, and this difference raises questions about the reproducibility of the findings.
Running the code as-is yields accuracies that are significantly different from those in the logs. This could be due to a number of factors, including variations in the training setup, differences in the codebase used, or simply random fluctuations. Pinpointing the exact cause requires a detailed comparison of the experimental settings and configurations.
Request for Detailed Information: To help resolve this, I'm requesting the authors to share the following information:
- Exact Commit Hash: The specific commit hash of the repository used to generate the logs.
- Config Files: The exact configuration files used for the experiments.
- Seeds: The random seeds used for training (to ensure reproducibility).
- Domain/Class Order: The order in which domains and classes were presented to the model.
- Non-Default Hyperparameters: Any hyperparameters that were set to non-default values.
- Checkpoints: If possible, the checkpoints used to generate the logs.
Sharing this level of detail will allow others to meticulously recreate the experimental setup and identify the source of the discrepancies. It's like having a detailed map to follow, ensuring everyone arrives at the same destination.
The Significance of Reproducible Results
In scientific research, reproducibility is the bedrock upon which credibility and progress are built. When results can be consistently reproduced by independent researchers, it strengthens the validity of the findings and fosters trust within the scientific community. Conversely, if results cannot be replicated, it raises questions about the methodology, implementation, or even the conclusions drawn from the research.
Why is Reproducibility So Important?
- Validates Scientific Claims: Reproducibility serves as a critical check on the accuracy and reliability of scientific claims. If an experiment can be replicated, it provides strong evidence that the observed effects are real and not due to chance or error.
- Builds Trust in Research: When researchers can reproduce each other's work, it builds trust and confidence in the scientific process. This trust is essential for the advancement of knowledge and the application of research findings in real-world contexts.
- Facilitates Progress: Reproducible research enables other scientists to build upon existing work, extend the findings, and apply them to new problems. This cumulative process is how scientific progress is made.
- Identifies Errors and Biases: Attempting to reproduce research can uncover errors in the original methodology, implementation, or analysis. It can also reveal potential biases that might have influenced the results.
- Ensures Transparency and Accountability: Reproducibility promotes transparency in the research process, as it requires researchers to clearly document their methods, data, and code. This accountability helps to maintain the integrity of scientific research.
Challenges to Reproducibility:
Despite its importance, achieving reproducibility in research can be challenging. Some common obstacles include:
- Lack of Detailed Documentation: Incomplete or unclear documentation of methods, data, and code can make it difficult to replicate an experiment.
- Software and Hardware Dependencies: Research that relies on specific software versions or hardware configurations may be hard to reproduce if those resources are not readily available.
- Randomness and Variability: Some experiments involve random processes or are subject to natural variability, making it challenging to obtain the exact same results.
- Publication Bias: There is often a bias towards publishing positive results, which can lead to an underrepresentation of negative or null findings. This can make it difficult to assess the true reproducibility of research.
- Complexity of Experiments: As research becomes more complex, with intricate methodologies and large datasets, the challenges of reproducibility increase.
Best Practices for Enhancing Reproducibility:
To promote reproducibility, researchers should adopt best practices such as:
- Detailed Documentation: Clearly document all aspects of the research, including methods, data, code, and experimental setup.
- Open Data and Code: Make data and code publicly available whenever possible.
- Version Control: Use version control systems to track changes in code and ensure that the exact version used in the research can be retrieved.
- Standardized Protocols: Follow standardized protocols and reporting guidelines to ensure consistency and comparability across studies.
- Replication Studies: Conduct replication studies to verify the findings of original research.
In the context of the DomainNet experiments, addressing the accuracy gap is crucial for ensuring the reproducibility and validity of the research. By sharing detailed information about the experimental setup, the authors can help others to replicate their results and build upon their work.
Conclusion
I'm confident that by addressing these issues – the DomainNet training failure, the OT-cost mismatch, the CDDB forgetting discrepancy, and the accuracy gap – we can get a clearer picture of the research findings. I've offered to share my full console logs and configurations to aid in the debugging process. Sharing the exact scripts, configurations, and any missing components used for DomainNet and CDDB would make the reproduction process much smoother.
Thanks in advance for your help and insights! Let's work together to make this research fully reproducible and accessible. This collaborative approach is key to advancing our understanding and pushing the boundaries of what's possible in machine learning. Reproducibility isn't just a box to check; it's the foundation of trustworthy and impactful research, and I'm excited to contribute to that goal.