India's outdated copyright laws threaten to stall AI innovation and research
India’s copyright laws may soon face challenges as artificial intelligence development grows. Current rules could limit access to digital materials needed for training AI models. Experts warn this might create a 'knowledge enclosure', where only licensed or private entities control key datasets. Text and Data Mining (TDM) is the process behind training AI systems using vast amounts of text, images, audio, and code. It involves three main steps: copying full-text materials, extracting structured data, and storing embeddings for future analysis. While this method fuels innovation, India’s existing copyright framework struggles to accommodate it.
Section 52 of the Indian Copyright Act allows fair dealing for 'private or personal use, including research'. However, it was never designed for large-scale automated copying and storage required by machine learning. Meanwhile, Section 14 grants authors control over electronic reproduction and storage, potentially restricting AI training without explicit permission. Other regions have adopted different approaches. The EU’s 2019 DSM Directive introduced a two-tiered TDM system, exempting scientific research by public institutions while allowing opt-outs for rightholders. The UK permits TDM for any purpose if technical safeguards are met. In the US, fair use principles often protect large-scale scanning when deemed transformative and beneficial to the public. Critics argue that India’s current laws risk concentrating power in the hands of a few businesses. Those with access to high-quality datasets gain a significant market advantage, reinforcing a 'data-enclosure' effect that could stifle competition and innovation.
Without updates to its copyright framework, India may fall behind in AI development. The EU’s two-tiered TDM model or the UK’s broader exception could offer solutions. Adjusting the law would help balance the needs of researchers, businesses, and rightholders while preventing a restricted digital knowledge landscape.