MUSA Monthly Newsletter
Issue 6 | June 13, 2023
| |
Welcome to the Mitigating Unauthorized Scraping Alliance newsletter, where we highlight topics of interest related to unauthorized data scraping. Unauthorized data scraping involves the automated collection of user data at scale that violates a platform's Terms of Service. | |
Featured Articles and Events | |
MUSA Educational Video Series
MUSA recently released an educational video series as part of its effort to build awareness around unauthorized scraping.
Check out the video series below to learn more about unauthorized scraping and how MUSA is working to solve this problem.
|
| |
MUSA Hosts Webinar on Generative AI & Unauthorized Scraping
MUSA hosted a webinar which examined the benefits, risks, and challenges associated with Generative AI (GenAI) and its relationship with unauthorized scraping. The panel featured David Patariu, Attorney at Venable LLP; Daniel Gervais, Milton R. Underwood Chair in Law and Director of Intellectual Property Program at Vanderbilt University; and Brandi Guerkink, Senior Policy Fellow at Mozilla Foundation in conversation with Venable LLP’s partner, A.J. Zotolla moderated the discussion.
| | |
Industry & Scraping In the News | |
The War Against AI Web Scraping
This article discusses howElon Musk and Reddit are leading a new wave of objections to scraping. The article highlights differences between historical scraping and AI scraping, noting that scraping for AI data training absorbs texts and images without compensation for website owners, who have to pay the server costs of being scraped. Elon Musk has threatened to sue Microsoft for using the platform’s content to train AI models, while Reddit has suggested companies need to pay it for doing the same. The article suggests that bots scraping content for AI models break the social contract of the internet, under which tools like search engines would point users back to the original source.
Read more on ScienceDirect
| |
DarkBERT: A Language Model for the Dark Side of the Internet,
Recent research suggests that there are clear differences in the language used in the Dark Web compared to that of the Surface Web. As studies on the Dark Web commonly require textual analysis of the domain, language models specific to the Dark Web may provide valuable insights to researchers. In this work, authors introduce DarkBERT, a language model pre-trained on data scraped from the Dark Web. The evaluations show that DarkBERT outperforms current language models and may serve as a valuable resource for combating cybercrime.
Read more on Arxiv
| |
Bad Bots are Coming for APIs
This article discusses the growing proportion of bot traffic and the disruptions caused by malicious automation, including scraping activity, that results in tangible business risks, including brand reputation issues, to reduced online sales, and security risks. The author urges businesses to act now and invest in bot management and online prevention that can identify and stop sophisticated automation that targets APIs and application business logic.
Read more on Help Net Security
| |
How to Get Your Personal Information Off the Internet
This article provides guidance on removing personal information from the internet. It discusses the potential risks associated with having personal information publicly available and susceptible to unauthorized scraping and offers practical steps to protect privacy. The article suggests conducting a thorough search for personal information online, contacting websites and online platforms to request removal, utilizing privacy tools and services, and being cautious about sharing personal information in the future. It emphasizes the importance of being proactive in safeguarding personal data and maintaining control over online presence and reputation.
Read more on HG Legal Resources
| |
Bad Bots! Bad Bots! What Can You Do When They Come For You?
This article discusses the issue of malicious automated bots that facilitate illicit activities such as data scraping and provides suggestions on how to mitigate their impact. It highlights the growing threat of bad bots that engage in activities such as data scraping, account takeover, and fraudulent activities. The article suggests implementing various measures to protect against malicious bots, including deploying bot management solutions, employing CAPTCHA tests, monitoring network traffic for unusual patterns, and implementing strict access controls. The article emphasizes the importance of proactive bot mitigation strategies to safeguard online platforms, protect user data, and maintain a positive user experience.
Read more on TechWire
| |
AI Machines Aren’t ‘Hallucinating’ but Their Makers Are
The article explores the concerns surrounding artificial intelligence (AI) and deep fakes. The author highlights the increasing autonomy of AI systems in generating deepfakes and warns about the challenges of distinguishing between authentic and manipulated content. The author emphasizes the importance of public engagement and democratic decision-making to ensure responsible AI development that aligns with societal values and safeguards against potential harms, citing relevant legal and regulatory precedents that can be enforced for illegitimately appropriated and scraped data as in the cases of Cambridge Analytica and Everalbum.
Read more on The Guardian
| |
Twitter Just Closed the Book on Academic Research
This article discusses the concerns raised by scientists and researchers regarding Twitter's updated API policy and its potential impact on academic research. The article highlights that Twitter's new policy restricts automated data collection and imposes limits on the sharing of collected data, which could hinder the ability of researchers to study public discourse and analyze trends on the platform. Scientists argue that such limitations impede their ability to understand social phenomena, develop insights, and contribute to public knowledge. The article emphasizes the need for a balance between user privacy and enabling valuable research, with calls for increased transparency and collaboration between social media platforms and the academic community.
Read more on The Verge
| |
Anonymous Intelligence Company Announces “Turminal.ai” a Revolutionary Privacy Protected AI Dashboard
This articles announces the launch of "Turminal AI," a privacy-protected AI dashboard, by Anonymous Intelligence Company. The dashboard is described as a revolutionary solution that enables users to harness the power of AI while preserving data privacy. It utilizes advanced algorithms and encryption techniques to ensure that scraped user data remains secure and anonymous. The Turminal AI dashboard is said to offer a range of features, including data analysis, predictive insights, and personalized recommendations. The press release highlights the potential of this platform in various industries, such as healthcare, finance, and marketing.
Read more on GlobalNewswire
| |
A Gateway Threat: How To Stop Scraper Bots In Their Tracks
This article discusses the issue of scraper bots and offers strategies to mitigate their impact. The article defines scraper bots as automated programs that extract data from websites for various purposes, including content theft and competitive intelligence. It highlights the negative consequences of scraper bots, such as compromised user experience, stolen intellectual property, and increased server loads. The article suggests several methods to counter scraper bots, including implementing CAPTCHA tests, utilizing web application firewalls, monitoring network traffic, and employing bot management solutions. It emphasizes the importance of understanding the threat posed by scraper bots and implementing proactive measures to protect websites and valuable data from unauthorized scraping activities.
Read more on Forbes
| |
Scribd Is Not a Fan of AI Scraping their Service for Data
This article discusses Scribd's stance against AI scraping their platform for data. Scribd, an ebook and audiobook subscription service, has expressed concerns about automated bots extracting content from their platform without permission. The article highlights the challenges that Scribd faces in protecting their content and ensuring a fair experience for their subscribers. It mentions Scribd's efforts to detect and block scraping bots, as well as their consideration of legal actions against those engaging in unauthorized data scraping. The article also emphasizes the importance of respecting intellectual property rights and the need for platforms like Scribd to take measures to prevent AI scraping activities that undermine their services and revenue streams.
Read more on Good E-reader
| |
ByteDance Is China's 'Propaganda Tool,' Ex-Employee Says
This article reports on claims made by a former employee regarding ByteDance, the Chinese technology company behind TikTok. The ex-employee alleges that ByteDance operates as a propaganda tool for the Chinese government. The article mentions the employee's assertions that ByteDance scraped content from Instagram and Snapchat and posted the content on its own websites via fake accounts to boost popularity. The former employee also alleges that ByteDance is involved in scraping surveillance and data collection on behalf of the Chinese government. The article highlights the concerns raised regarding ByteDance's potential role in spreading propaganda and influencing public opinion through its popular platforms.
Read more on Law 360
| |
Bots Now Make Up Nearly Half of All Internet Traffic, and That's Very Bad News for Our Security
This article reveals that bots now constitute nearly half of all internet traffic, presenting a significant security concern. The article emphasizes the potential dangers associated with the rise in bot-driven activities, such as unauthorized scraping, cyberattacks, data breaches, and misinformation campaigns. It discusses the various types of bots, including malicious bots and those used for legitimate purposes, but notes that distinguishing between them can be challenging. The article highlights the need for robust security measures, including advanced bot detection and mitigation techniques, to protect online platforms, user data, and maintain a secure internet environment. It underscores the importance of proactive measures to address the growing threat posed by bots and safeguard against potential security risks.
Read more on TechRadar
| |
Firms' Sites Were Scraped to Train AI Models. Legal Isn't Concerned...for Now
While web scraping of publicly available data for LLM training has received pushback from artists and content creators, the reaction from the legal community regarding scraping of firm data has been “less negative” with some even seeing it as an opportunity to “broaden their marketing reach”. However the lack of attribution indicates that opportunities for brand profiling are limited, and there are many other questions about GenAI that take precedence for legal professionals.
Read more on Law.com
| |
Bright Data Accused of Scraping Minors’ Data from Instagram
This article describes a proposed class-action suit in Israel accusing data collection company Bright Data of allegedly selling personal information about minors pulled from Facebook and Instagram, in violation of privacy laws.
Read more on Bloomberg Law
| |
Legislation, Regulation, & Court Cases In the News |
US Tells Supreme Court to Turn Down Google Lyric Scraping Case
The Biden administration has said the US Supreme Court shouldn’t take up a case involving allegations that Google LLC illegally scraped millions of lyrics from the song annotation website Genius and posted them at the top of search results pages. In a brief filed last month, the US solicitor general indicated that the case would be a “poor vehicle for clarifying” breach-of-contract claims and copyright law regarding which there is disagreement among the courts of appeals. The implications of this case could undermine the business models of companies that aggregate user content and information and rely on terms of service agreements to prevent that “content from being posted elsewhere even when the companies don’t have a copyright.”
Read more on Bloomberg Law
| |
EU AI Act Draft Approved
A new draft of the EU’s AI Act has been approved by parliamentary committees and includes new prohibitions on “intrusive and discriminatory uses of AI systems” including biometric surveillance through “indiscriminate scraping of biometric data from social media or CCTV footage” and predictive policing algorithms. Once enacted, this legislation will likely have major implications for countries around the world and is expected to set the standard for global regulation.
Read more on The Verge
| |
France’s Privacy Watchdog Eyes Protection Against Data Scraping in AI Action Plan
The French privacy regulator, the National Commission on Informatics and Liberty (CNIL), has published an action plan for AI which gives a snapshot of where it will be focusing its attention in the coming months. CNIL has indicated that it is paying special attention to “the protection of publicly available data on the web against the use of scraping…of data for the design of tools”. Other questions of interest include fairness and transparency of data processing and protection of data transmitted by users when they use AI tools.
Read more on TechCrunch
| |
Unique Issues To Look Out For In Generative AI Transactions
AI technology transactions raise questions around models that are trained on data scraped from the internet. Questions around the permissibility of web-scraping and fair use qualification remain unresolved. The lack of clear legal guidance means that model providers may be “reluctant to indemnify the model customer for claims arising from such data”, meaning providers will likely bear “most of the risk arising from use of such data for initial training” and activities that could give rise to claims.
Read more on Law 360
| |
The AI-generated Picture Becomes Clearer – Key Legal Considerations Emerging for Generative AI Developers and their Customers
This article covers best practices for AI corporate governance solutions, global legislative and regulatory developments, and developer considerations for compliance in Australia. In regards to scraping, the guidance suggests that developers should ensure that data collection practices for LLM training conforms with the requirements under Australian state and federal privacy legislation, particularly around personal information. Australia's privacy regulator, has previously penalized AI companies for unauthorized data scraping practices. LLM developers should also invest in tools that can identify copyrighted materials and seek clearance to “avoid or mitigate the risk of third-party copyright infringement cases”.
Read more on Allens
| |
Seize the Data! : Legal and Regulatory Issues for Artificial Intelligence (AI) Training Data
This article discusses some of the legal and regulatory issues for AI model training from scraped data. Scraping and misappropriation of copyrighted work can lead to IPR infringement if scraped without permission of the owner. Scraping can also “attract liability under breach of contract” if website content is protected by terms of use agreements. Additionally, particular considerations to data protection legislation in regards to scraping of personal information.
Read more on Laytons ETL
| |
The Mitigating Unauthorized Scraping Alliance (MUSA) brings together leading companies committed to protecting data from unauthorized scraping and misuse. In collaboration with industry members, policymakers, and the public, MUSA is generating a global dialogue around unauthorized data scraping focused on protecting user data through education, advocacy, public-private partnerships, and the sharing of reasonable practices to mitigate unauthorized scraping. | | | | |