MUSA Monthly Newsletter

Issue 5 | May 9, 2023

Welcome to the Mitigating Unauthorized Scraping Alliance newsletter, where we highlight topics of interest related to unauthorized data scraping. Unauthorized data scraping involves the automated collection of user data at scale that violates a platform's Terms of Service.

Join our Mailing List
Join our Industry Meetings

Featured Articles and Events

Webinar: The Rise of Generative AI and Unauthorized Scraping- Exploring the Ethical and Legal Considerations in the Age of Big Data

Join MUSA for a webinar that will explore whether the growing demand for datasets to be used to feed generative AI model training has contributed to the rise of unauthorized scraping and whether this gives rise to questions around privacy risks and the need for transparency. Experts from civil society, academic and industry fields will come together to discuss the benefits, risks, and challenges of generative AI and its relationship to unauthorized scraping.


Date and time: Thursday, May 18th 

9:00 PST/ 12:00 PM EST (45 min talk + Q&A)

Read more and register here

How to Safeguard Valuable Data from Malicious Data Scraping

Check out PwC’s Cyber & Privacy Innovation Institute’s recent thought leadership article on safeguarding data from malicious scraping. The report includes a list of practices to detect and prevent unauthorized data scraping and highlights MUSA as a resource for data scraping mitigation practices and an emerging leader in building industry collaboration around the issue. Additionally, PwC has released an industry data protection and scraping prevention self-assessment tool, which can be used by companies to evaluate how their capabilities in data protection and in the detection and prevention of unwanted data scraping compare to that of industry peers.

Read full report on PwC

Industry & Scraping In the News

Chatbots Are Digesting the Internet. The Internet Wants to Get Paid.

Tech and media companies are beginning to ask for compensation from artificial-intelligence companies for data scraped from their platforms that is used in training language-based AI models. In addition to questions around ownership of information, experts have raised concerns about personal information being included in AI datasets and information verification of chatbot outputs. Further complicating this issue is the fact that AI algorithms can’t be held accountable for their actions, suggesting new legislation and regulation may be on the horizon.

Read more on the WSJ 

HUMAN Releases 2023 Enterprise Bot Fraud Benchmark Report: An Inside Look at Bot Attack and Fraud Trends Impacting Enterprise Organizations Online

Human Security, Inc. released its 2023 Enterprise Bot Fraud Benchmark Report last month. The annual report provides insights into automated attack trends across enterprise use cases, including account takeover, brute forcing, carding, credential stuffing, inventory hoarding, scalping, and web scraping. The report highlights a rise in automated attacks including scraping and emphasizes the need for a comprehensive and collaborative approach in taking proactive measures against attackers.

Read more on Human Security 

An AI Scraping Tool Is Overwhelming Websites With Traffic

The recent popularity of AI tools raises questions about consent and ownership. The creator of the free image scraping tool img2dataset that generates image datasets has indicated that website owners would have to actively opt out if they want to prevent their websites from being scraped. Website owners have been overwhelmed by the increased traffic and costs stemming from bots and maintain that “datasets built on non-consensually obtained data” present risks to both owners and users of those models.

Read more on Vice 

Major Record Label Orders Streaming Services to Stop AI Data Scraping

Universal Music Group told music streaming services including Spotify and Apple to limit AI scraping efforts from scraping its catalogs for data, which could pose major ramifications for AI software. UMG noted that there is a “moral and commercial responsibility” to “prevent the unauthorized use of [artists’] music” and that to stop platforms from ingesting content that violates the rights of artists and other creators. 

Read more on Washington Examiner

Using Big Data to Reduce Leaks

This article explores the application of big data analysis in preventing and mitigating the risk of sensitive information leaks, especially in government organizations. It emphasizes the significance of monitoring and analyzing user behavior, access, and data usage patterns to identify any unusual or suspicious activity. Through the use of predictive analytics and machine learning algorithms, organizations can take proactive measures to detect and prevent potential leaks. The article also underscores the importance of robust data security policies, employee training, and continuous monitoring to minimize the risk of data breaches and leaks.

Read more on CSIS 

POV: Big Tech has a glaring double standard when it comes to web scraping

Big tech companies penalize other firms for scraping data from their websites, while they themselves engage in similar practices using the vast amounts of user data at their disposal for their own commercial gain. Meanwhile, firms attempting to access publicly available data for legitimate research or analysis face legal action and penalties. This article advocates for equal and consistent regulations governing web scraping and data privacy, irrespective of a company's size or market power.

Read more on Fast Company

Turkey: Web Scraping And Protection Of Websites

This article discusses web scraping and the importance of protecting websites from unauthorized data scraping. It highlights the legal implications of web scraping and the potential copyright violations that can occur when third parties scrape data from a website without permission. The article also emphasizes the need for website owners to take proactive measures to safeguard their sites, including the use of technical solutions such as bots and firewalls to prevent unauthorized access. Finally, it underscores the importance of adhering to copyright laws and obtaining proper permissions when scraping data from websites for legitimate purposes.

Read more on Mondaq 

VPN vs. Proxy: What's the Difference?

This article discusses the differences between VPNs and proxies, two tools that can be used to mask your online identity and activity. The article explains that VPNs are more comprehensive and secure, as they encrypt all of your online traffic and route it through a remote server, while proxies only mask your IP address and are more susceptible to security vulnerabilities. However, proxies may be a more cost-effective option for basic tasks such as anonymous browsing or simple data scraping. Overall, the article provides a useful comparison of the two tools and highlights the importance of understanding their differences when choosing the appropriate one for your needs.

Read more on PCMag

Podcast Episode: Generative AI and Copyright, When AI Hits the Music Business, The Social Media that Comes After Twitter

This podcast episode highlights current copyright and fair use debates regarding text-to-image generation and unauthorized content scraping, the stakes for incumbents and upstarts, and AI implications for the music business. 

Listen more on Sharp Tech with Ben Thompson

Unpicking the Rules Shaping Generative AI

Generative AI’s growth has resulted in varied global legal responses regarding the use of personal data, posing a challenge to the makers of generative AI services to ensure they are in compliance. Notably, Italy’s data protection authority ordered Open AI to increase transparency and data access controls as well as protect minor’s data when processing Italian data through Chat GPT, citing a breach of GDPR. Beyond the EU, Canada’s privacy watchdog also recently stepped in to announce a probe of ChatGPT. 

Read more on TechCrunch

AI - Clari CEO Says Niche is Good When it Comes to Enterprise GPT

CEO of revenue software specialist Clari, Andy Byrne, highlights the benefits of GPT-powered tools like RevGPT that are trained on company owned, trusted data rather than data sets scraped from the web. Models that rely on accuracy can generate much more powerful predictions and suggestions following this method. Byrne suggests that for many industry Generative AI use cases the data and workflows are much more important than the algorithms themselves. In addition, Byrne highlights that this is a good opportunity for government policy to create guardrails around privacy and security risks of this growing technology.

Read more on Diginomica

Bloomberg to Launch AI Model Powered by OpenAI's GPT to Aid Financial Insights, Automation

Bloomberg LP is preparing to launch an innovative AI model to perform finance-specific tasks. Bloomberg’s GPT promises to perform better and provide more accurate and reliable financial insights than other language models because it has been trained on terabytes of collected and scraped financial documents and data collected by the company over time.

Read more on Tech Times

Stable Diffusion and DALL-E Display Bias When Prompted for Artwork of 'African Workers' Versus 'European Workers'

AI technology has raised concerns regarding intellectual property, bias, and disinformation. Some AI models like Stable Diffusion, which is trained on LAION-5B, a large open-source dataset of images scraped from the web, have been shown to produce images that reflect harmful stereotypes. Similarly, Stability AI uses a system called CLIP to help it generate images, which has been found to include gender and racial bias. Experts have suggested that model developers should collect better training data; however, manual data collection at scale is both time consuming and expensive than the efficient web scraping alternative.

Read more on Business Insider

Access to Social Media Data for Public Interest Research: Lessons Learnt & Recommendations for Strengthening Initiatives in the EU and Beyond

This paper produced as part of the project Digital Policy Lab funded by the German Federal Foreign Office examines access to social media data for public-interest research, reviews lessons from industry-academia partnerships, and provides targeted recommendations. Notably, the paper examines ethical and legal problems related to data collection efforts that get around data access barriers such as unauthorized scraping as well as the implications for research quality related to barriers to data access.

Read more on ISD

AI Should Pay for News Content: Rod Sims

The former chairman of the Australian Competition and Consumer Commission has stated that artificial intelligence models such as ChatGPT should be forced to pay for access to content. The scraping of news publication raises copyright flags under Australia’s News Media Bargaining Code, a law designed to have large technology platforms that operate in Australia to pay local news publishers for the news content made available or linked on their platforms. Many publishers around the world are already exploring different ways to negotiate with AI firms as questions around compensation and access continue to grow.

Read more on Australian Financial Review

The Guardian Agrees Deal with Illuma to Categorise Article Pages and Protect Intellectual Property

The Guardian has partnered with Illuma, a British tech company, to categorize article pages for contextual advertising, while protecting its intellectual property rights. This eliminates the need for unauthorized scraping of text and data from websites by third party companies, which can result in miscategorization, hurt the user platform experience, and result in a loss of revenue. This agreement will grant direct access to The Guardian’s content API, paving the way for legal licensing agreements and deterring unauthorized scraping.

Read more on The Guardian

Inside the Secret List of Websites that Make AI like ChatGPT Sound Smart

AI chatbots mimic human speech based on large amounts of data scraped from the internet. The Washington Post and the Allen Institute analyzed Google’s C4 data, which is used to instruct Large Language Models. They found that the data set came from industries like journalism, entertainment, software and medicine, but also included content from websites like Stormfront and 4chan as well as voter registration databases. The analysis raises legal questions, as the dataset included copyright material and potentially identifiable personal information, which companies may avoid documenting due to concerns about privacy and copyright infringement. 

Read more on the Washington Post 


Reddit Will Charge Companies and Organizations to Access its Data—and the CEO is Blaming A.I.

Reddit is introducing a paid tier of access for large companies accessing its data. Reddit data has been used to train large language models to generate  “natural-sounding” answers, raising concerns around authorization since user content is scraped without permission.

Read more on Fortune

Stack Overflow Will Charge AI Giants for Training Data

The programmer Q&A site Stack Overflow plans to begin charging large AI developers for use of its data to train AI algorithms and ChatGPT-style bots. This follows Reddit’s announcement that it will begin charging some AI developers to access its own content and new principles from The News/Media Alliance around negotiation for any use of their data for training and other purposes, citing the need for fair compensation of content contribution.

Read more on Wired

EU's AI Legislation Aims to Protect Businesses from IP Theft

A new draft of EU artificial intelligence (AI) legislation could better protect business intellectual property from being scraped by AI firms, with developers facing new transparency obligations on copyrighted content. The aim is to protect unauthorized uses of data, and the bill would offer companies legal grounds to establish the degree to which AI firms they work with are using ethically sourced, non-copyrighted data.

Read more on ITPro

Legislation, Regulation, & Court Cases In the News

The Legal Landscape of Web Scraping

This article delves into the legal implications of web scraping and provides an overview of the current legal landscape surrounding the practice. It discusses the legal challenges associated with web scraping, including copyright infringement and violation of website terms of use. The article also highlights the importance of obtaining consent and complying with applicable laws when engaging in web scraping activities. Finally, the article provides recommendations for companies seeking to engage in web scraping while minimizing legal risks, such as seeking legal advice, obtaining explicit consent, and implementing robust data security measures.

Read more on Quinn Emanuel

Ryanair Takes Loss on Booking.com Counterclaim in Scraping Suit

In 2020, Ryanair sued Booking.com and its affiliates for illegally scraping data from the airline's website to sell Ryanair flights at higher fares on its own site. A federal judge has now ruled that Ryanair must face allegations of making defamatory statements as part of its efforts to discourage customers from using Booking.com. The judge also decided against dropping most of Booking.com’s counterclaims against Ryanair, which include claims of revenue loss due to customers being warned about "unauthorized" third-party travel sites. The case raises complex issues around illegal web scraping, copyright infringement, and defamation, highlighting the importance of companies developing comprehensive IP policies and engaging in proactive risk management.

Read more on Bloomberg Law

Intellectual Property Legal Issues Impacting Artificial Intelligence

This article discusses key intellectual property legal issues that companies should consider when developing artificial intelligence (AI) technology. It notes that AI technology can raise complex questions around ownership, licensing, and infringement of intellectual property rights, such as patents and copyrights. The article provides guidance on how companies can navigate these issues, including developing comprehensive IP policies, conducting IP due diligence, and engaging in proactive risk management. Additionally, the article highlights the importance of staying up-to-date with the rapidly evolving legal landscape surrounding AI technology and intellectual property law.

Read more on JD Supra

Generative AI and Intellectual ‎Property: Whether the Wild West or the Matrix, It ‎is the ‎‎‎(Latest) New Frontier‎

This article discusses how the emerging technology of generative AI raises challenges to traditional notions of intellectual property law. This technology can create copyright issues and can complicate determining ownership and liability for generated content. The article highlights the need for updated intellectual property laws and proactive measures, such as developing ethical guidelines and obtaining explicit consent from users, to ensure responsible use of generative AI.

Read more on JD Supra

Generative AI: can intellectual property infringements in training data be avoided?

Unsourced training data could create copyright or database right infringement risks in the EU and UK. Users and developers of AI systems in the UK have been exploring database right exceptions, copyright exceptions, text and data copy exceptions, and temporary copy exceptions that may apply to AI training of data, though most are narrow and unlikely to apply in a commercial context. 

Read more on Lexology

About MUSA

The Mitigating Unauthorized Scraping Alliance (MUSA) brings together leading companies committed to protecting data from unauthorized scraping and misuse. In collaboration with industry members, policymakers, and the public, MUSA is generating a global dialogue around unauthorized data scraping focused on protecting user data through education, advocacy, public-private partnerships, and the sharing of reasonable practices to mitigate unauthorized scraping.

Connect with us:

LinkedIn  Web  Email  Twitter