Methods to detect Critical Raw Materials in patents

patstat
data-science
Published

October 13, 2024

For an ongoing research about Critical Raw Materials (CRM) and green technologies, we were trying to identify inventions that use CRM. Our first idea was to search for mentions of CRM into patent abstracts, but someone suggested that the Cooperative Patent Classification could be useful to select inventions which use CRM.

The CPC (Cooperative Patent Classification) is a joint effort between the European Patent Office and the US Patent and Trademark Office to produce a detailed technological classification. It is a fine grain hierarchical classification system with more than 250,000 categories.

We compare both methodologies to decide which is the most comprehensive one.

Identifying CRM in CPC

An advantage of using a well structured classification is to not worry about word variations: CPC includes only the official name of each CRM, without misspelled or erroneous words.

First, we download from the CPC website the list of all CPC Titles, and then we search for CRM into the list with the following command in a linux terminal (CRM.list.txt contains the list of CRM):

for crm in $(cat CRM.list.txt); do for f in $(ls cpc-section*); do grep -w "$crm" $f | sed -e "s/^/$crm       /" >> res_crm_search2.csv; done; done

Then, we get all the appln_id of the patent applications identified by these CPC codes.

res.search <- fread("res_crm_search2.csv")
names(res.search) <- c("CRM","cpc_class_symbol","Level","Title.last")

tls224 <- fread("~/Downloads/PATSTAT/tls224_part01.csv")
tls224p2 <- fread("~/Downloads/PATSTAT/tls224_part02.csv")
tls224 <- rbindlist(list(tls224,tls224p2))
nrow(tls224)
res.appln_ids <- merge(res.search[,c("CRM","cpc_class_symbol")],tls224)

We add inpadoc_family_id info to the dataset to create res.inpa and we merge it with the list we already have of patent applications detected using mentions in patent abstracts (inpa.tm contains pairs of inpadoc_family_id and CRM detected).

res <- unique(rbindlist(list(
  merge(inpa.tm,res.inpa[,c("inpadoc_family_id","CRM.cpc")]),
  merge(inpa.tm,res.inpa[,c("inpadoc_family_id","CRM.cpc")], all.x = TRUE),
  merge(inpa.tm,res.inpa[,c("inpadoc_family_id","CRM.cpc")], all.y = TRUE)
)))
res[is.na(res)] <- "Not found"

In order to compare green vs non-green technologies coverage, we create 2 different sankey diagrams using this code:

# Preparing links data
dt.net <- res[,.(Nfam = uniqueN(inpadoc_family_id)), by=list(CRM.tm,CRM.cpc2)]
dt.net$CRM.tm <- paste0(dt.net$CRM.tm, " (tm)")
dt.net$CRM.cpc2 <- paste0(dt.net$CRM.cpc2, " (cpc)")

# Preparing nodes data
dt.nodes <- data.frame(
  name = unique(c(dt.net$CRM.tm,dt.net$CRM.cpc2))
)
dt.nodes$id <- 1:nrow(dt.nodes)
dt.links <- merge(dt.net,dt.nodes, by.x=c("CRM.tm"), by.y=c("name"))
dt.links <- merge(dt.links,dt.nodes, by.x=c("CRM.cpc2"), by.y=c("name"))
dt.links <- dt.links[,c("id.x","id.y","Nfam")]
names(dt.links) <- c("source","target","value")

dt.links$source <- dt.links$source - 1
dt.links$target <- dt.links$target - 1

# Sankey diagram
library(networkD3)
sankeyNetwork(Links = dt.links, Nodes = dt.nodes, Source = "source",
                   Target = "target", Value = "value", NodeID = "name",
                   units = "Nfam", fontSize = 12, nodeWidth = 30)

Results

On the two diagrams, the number of patent families by CRM detected using the text search in abstracts is represented on the left hand side, while the ones detected using CPC titles appear on the right hand side.

For both set of technologies (green and all patents), we can clearly see that the CPC methodology is under-representing the presence of CRM mentions in patents.