Here's the repo: https://github.com/NVIDIA/NeMo-Curator/
Some documentation on the fuzzy dedup scripts: https://docs.nvidia.com/nemo-framework/user-guide/latest/dat...
And a Python example: https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fu...
Would be interested in any feedback from the folks here.
Here's the repo: https://github.com/NVIDIA/NeMo-Curator/
Some documentation on the fuzzy dedup scripts: https://docs.nvidia.com/nemo-framework/user-guide/latest/dat...
And a Python example: https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fu...
Would be interested in any feedback from the folks here.