<- fread("res_crm_search2.csv")
res.search names(res.search) <- c("CRM","cpc_class_symbol","Level","Title.last")
<- fread("~/Downloads/PATSTAT/tls224_part01.csv")
tls224 <- fread("~/Downloads/PATSTAT/tls224_part02.csv")
tls224p2 <- rbindlist(list(tls224,tls224p2))
tls224 nrow(tls224)
<- merge(res.search[,c("CRM","cpc_class_symbol")],tls224) res.appln_ids
Methods to detect Critical Raw Materials in patents
For an ongoing research about Critical Raw Materials (CRM) and green technologies, we were trying to identify inventions that use CRM. Our first idea was to search for mentions of CRM into patent abstracts, but someone suggested that the Cooperative Patent Classification could be useful to select inventions which use CRM.
The CPC (Cooperative Patent Classification) is a joint effort between the European Patent Office and the US Patent and Trademark Office to produce a detailed technological classification. It is a fine grain hierarchical classification system with more than 250,000 categories.
We compare both methodologies to decide which is the most comprehensive one.
Identifying CRM in CPC
An advantage of using a well structured classification is to not worry about word variations: CPC includes only the official name of each CRM, without misspelled or erroneous words.
First, we download from the CPC website the list of all CPC Titles, and then we search for CRM into the list with the following command in a linux terminal (CRM.list.txt
contains the list of CRM):
for crm in $(cat CRM.list.txt); do for f in $(ls cpc-section*); do grep -w "$crm" $f | sed -e "s/^/$crm /" >> res_crm_search2.csv; done; done
Then, we get all the appln_id of the patent applications identified by these CPC codes.
We add inpadoc_family_id info to the dataset to create res.inpa
and we merge it with the list we already have of patent applications detected using mentions in patent abstracts (inpa.tm
contains pairs of inpadoc_family_id and CRM detected).
<- unique(rbindlist(list(
res merge(inpa.tm,res.inpa[,c("inpadoc_family_id","CRM.cpc")]),
merge(inpa.tm,res.inpa[,c("inpadoc_family_id","CRM.cpc")], all.x = TRUE),
merge(inpa.tm,res.inpa[,c("inpadoc_family_id","CRM.cpc")], all.y = TRUE)
)))is.na(res)] <- "Not found" res[
In order to compare green vs non-green technologies coverage, we create 2 different sankey diagrams using this code:
# Preparing links data
<- res[,.(Nfam = uniqueN(inpadoc_family_id)), by=list(CRM.tm,CRM.cpc2)]
dt.net $CRM.tm <- paste0(dt.net$CRM.tm, " (tm)")
dt.net$CRM.cpc2 <- paste0(dt.net$CRM.cpc2, " (cpc)")
dt.net
# Preparing nodes data
<- data.frame(
dt.nodes name = unique(c(dt.net$CRM.tm,dt.net$CRM.cpc2))
)$id <- 1:nrow(dt.nodes)
dt.nodes<- merge(dt.net,dt.nodes, by.x=c("CRM.tm"), by.y=c("name"))
dt.links <- merge(dt.links,dt.nodes, by.x=c("CRM.cpc2"), by.y=c("name"))
dt.links <- dt.links[,c("id.x","id.y","Nfam")]
dt.links names(dt.links) <- c("source","target","value")
$source <- dt.links$source - 1
dt.links$target <- dt.links$target - 1
dt.links
# Sankey diagram
library(networkD3)
sankeyNetwork(Links = dt.links, Nodes = dt.nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
units = "Nfam", fontSize = 12, nodeWidth = 30)
Results
On the two diagrams, the number of patent families by CRM detected using the text search in abstracts is represented on the left hand side, while the ones detected using CPC titles appear on the right hand side.
For both set of technologies (green and all patents), we can clearly see that the CPC methodology is under-representing the presence of CRM mentions in patents.