#124 From Open Web to Walled Gardens: The Decline of AI-Accessible Data
In this edition of Technopolitik, Rijesh Panicker writes about data available for training AI, and Avinash Shet follows with a piece on the state of India’s deeptech startup ecosystem.
This newsletter is curated by Adya Madhavan.
Technopolitik: Shrinking Data Commons for AI
— Rijesh Panicker
The latest AI Index Report from Stanford has an interesting chapter on the latest trends on data availability for AI training. Referencing this study from mid 2024, it shows that over the course of 2024, there has been a significant increase in the amount of restrictions placed on data crawling by AI models. Between 2016 and 2024, the study shows that the number of critical sites (identified based on the Common Crawl web corpora) with restriction for AI related crawling using a mechanism called the Robots Exclusion Protocol (REP), often referred to as “robots.txt” or using Terms of Service (ToS) agreements had gone down from 20% and 80% respectively, to near zero.
The study investigates the issue of data consent for AI and finds a few other interesting trends.
Misalignment between data and usecases
The largest domains used in AI training comes from news and encyclopedia websites, which are also the domains with increasing restrictions. Close to 44% of the top domains in the Common Crawl corpora - C4, have imposed full restrictions on the use of data by AI models. In addition, nearly 25% of tokens from the most critical domains (domains identified as having the largest number of tokens on the largest web corpora) are now under some form of restriction for AI related training and web crawling.
At the same time, the study identified the largest use case for chatGPT as creative composition, and this misalignment might change the nature of data collection in the future. We should expect AI models to spend more time hoovering up our posts and blogs in the future. Ultimately, the paper expresses concern that the loss of diversity in the data commons will create biases in the next generation of AI models.
Differentiated Permissiveness
The study also finds that not all web crawlers are being blocked to the same extent. Web crawlers like Google’s bot, whose intent is to crawl and index the web for search, are not heavily blocked. At the same time, AI-specific crawlers like those of OpenAI and Anthropic are restricted by robots.txt files on a large number of sites. In fact, OpenAI and Anthropic crawlers show the highest level of blocking, while crawlers from smaller AI players are not blocked at the same level.
A Problem of Expression
The problem of permissions is further exacerbated by differences in how the permission mechanisms work. Both TOS (Terms of Service) and the robots.txt express permissions. TOS allows a site to explain in detail how the content can be used and what licenses apply. The robots.txt file uses patterns and identifiers to exclude or include parts of the website for crawling purposes. The study find that in nearly 35% of cases, the TOS and the robots.txt file are at odds with each other, creating ambiguity on whether a site is available for crawling.
Originally designed for web crawlers, the robots.txt file has several issues with design, including that crawlers need not obey them unless they are explicitly identified. While this may have been easy in the early days of the web, where there were only a few web crawlers, it is near impossible to accurately identify all crawlers today. In addition, there is no way to enforce the robots.txt mechanism, and it depends on the goodwill shown by the web crawlers. While some AI firms like OpenAI and Google have provided detailed instructions on preventing their bots from crawling, this is an exception and not the rule.
What does the future hold?
A shrinking data commons is ultimately a bad thing for AI model development. On the one hand, it will negatively impact data scaling progress both by reducing the quantity and quality of data available. Increasing restrictions on data commons will also affect newer AI model developers as opposed to incumbents. We may see the a two tier market going forward. Those who have the ability to buy large and high quality datasets will be able to build better models, while others may need to make do with free but lower quality datasets.
Governments in general seem to be taking a pro AI stance in this regard. While personal data is kept out of bounds using existing privacy laws, there are no significant limitations on scraping of publicly available data for AI development, with a view to supporting the development of more powerful models.
Eventually, some kind of equilibrium is bound to emerge between AI model developers and content providers. The large players are likely to enter into agreements to use higher-quality private data, while smaller players may need to either distil knowledge for larger open-source models or generate and use synthetic data for training purposes.
Technopolitik: Unlocking India's Deep Tech Potential: The Path to R&D Competitiveness
— Avinash Shet
A new, broad and intense spotlight has dropped on the deeptech startup ecosystem in India amid union minister Piyush Goyal’s remarks on the Indian deeptech ecosystem. Conversations around this have taken up the attention of entrepreneurs, investors, and policymakers alike.
During Startup Mahakumbh in Delhi early this month, the minister of commerce and industry compared Indian startups with their Chinese counterparts. He emphasised that the startups from India’s neighbours are venturing into deep tech to solve future problems, whereas Indian startups are just into consumer-facing products like fast delivery, vegan ice creams, and cookies.
This remark evoked two kinds of reactions: one supporting the direction ministers want to take Indian startups in, and the other taking it literally and questioning the comparison with China. The latter reaction was very prominent. Entrepreneurs came out criticising the minister and providing stats on how non-tech and non-deep tech startups have created jobs, boosted the Indian economy, and paid taxes. Many did come out to criticise the comparison as not logical, especially comparing it with Chinese startups.
While the majority reacted negatively to this remark, it also started a positive conversation. While the comparison is not a valid tool to start this conversation, the conversation had to start somehow. Deep tech start-ups in China are indeed flourishing. These startups are getting bigger, smarter and advanced and can compete with the best in the world, including US rivals. Take, for example, electric car development. In December 2024, BYD, a Chinese startup, overtook Tesla in terms of sales of pure electric cars. Deepseek created a buzz in the AI space by creating a cutting-edge AI model with a fraction of the computational resources. This can be seen in many areas like space technology, quantum technology, Clean energy technology, semiconductors and more.
It is not like India does not have any deep tech startups. The scale of startups is limited. India has entrepreneurs venturing into semiconductors, space technology, quantum technology, nanomaterials and more. The nature of a deep-tech startup's journey differs from that of a tech or non-tech startup. In this, one does not use technology but builds a new one from scratch. These startups need a lot of initial funding, cutting edge infrastructure, and a long time before the product is developed, tested and introduced to the market. With such distinction, investing in ventures with high risk, massive investment, and long waiting time for exit is difficult. The bureaucratic hurdles for such startups are different from those that entities like Startup India are streamlining. Above all, if India wants to tackle limited deeptech startups, the risk-taking ability of entrepreneurs has to be looked into; these founders need some cushions and support to take on such a mission.
India is known for its low investment in the Research and development ecosystem. Private Industry spend low compared developed and other developing countries. Hence, deeptech startups are the hope to change this scenario for the nation, whose success is determined by its ability to do R&D.
A mere academic research on the deeptech startup ecosystem in India has now become a nationwide discussion. With this, a soaring hope emerges for entrepreneurs, investors and policymakers for reforms for India’s deeptech ecosystem. A decade down the line, India's deep tech startups might be a competition to their US and Chinese counterparts.
Apply here!