1223 01:52:12,960 --> 01:52:17,430 Luc Boruta: Um, so hi,everyone. I'm Luc, I'm the head 1224 01:52:17,430 --> 01:52:21,450 of Thunken. We're a small data science organization, we focus 1225 01:52:21,450 --> 01:52:23,850 on research data, and then trying to make impact 1226 01:52:23,850 --> 01:52:30,540 assessments and measurement diverse and fair. And because 1227 01:52:30,540 --> 01:52:35,850 I'm, I'll start with a disclaimer that whenever you 1228 01:52:35,850 --> 01:52:39,090 looking to impact assessment and determine different things, 1229 01:52:39,090 --> 01:52:41,940 there's a divide between quantitative approaches and 1230 01:52:41,940 --> 01:52:46,470 qualitative approaches what I call big quant and big qual and the 1231 01:52:46,470 --> 01:52:50,970 community is hyper focus on quantitative methods, I myself 1232 01:52:50,970 --> 01:52:55,650 fell down the rabbit hole about five years ago. So I've included 1233 01:52:55,680 --> 01:52:59,700 pointers to discussions and projects, around more 1234 01:52:59,700 --> 01:53:02,970 qualitative approaches, including a survey and Overton.io 1235 01:53:02,970 --> 01:53:06,360 But this short talk will focus on the issues of 1236 01:53:06,420 --> 01:53:11,910 fairness in quantitative methods. And because I'm really 1237 01:53:11,910 --> 01:53:15,420 fun at parties, I will start with five sad hard truths 1238 01:53:15,450 --> 01:53:20,280 about impact measurements. So the first is that impact cannot 1239 01:53:20,280 --> 01:53:23,670 be measured, and it cannot be measured, because not everything 1240 01:53:23,670 --> 01:53:27,270 that can be counted counts, then more importantly, because not 1241 01:53:27,270 --> 01:53:31,590 everything that counts can be counted. Then, because metrics 1242 01:53:31,590 --> 01:53:35,940 will never be alt enough. And when I use alt in the sense of 1243 01:53:35,970 --> 01:53:40,980 alternative, fair, diverse, inclusive, and then because 1244 01:53:40,980 --> 01:53:46,200 metrics are bound to be abused. And I'll give you examples 1245 01:53:46,230 --> 01:53:49,350 during the talk of what I mean by this five sad hard truths 1246 01:53:49,350 --> 01:53:54,210 about impact measurement. But first, if you want to measure 1247 01:53:54,210 --> 01:53:59,460 impact or attention, I'll start with what you should not do. And 1248 01:53:59,460 --> 01:54:01,800 it's called the quantitative fallacy. It's also called the 1249 01:54:01,800 --> 01:54:06,030 McNamara fallacy. So first, you'll start by measuring what 1250 01:54:06,030 --> 01:54:10,110 can easily be measured, you know, fair enough. But then over 1251 01:54:10,110 --> 01:54:12,810 time, you'll start these regarding whether it cannot be 1252 01:54:12,840 --> 01:54:16,530 easily measured. And you'll start presuming that what can be 1253 01:54:16,530 --> 01:54:20,190 easily measured really is unimportant. And even worse, 1254 01:54:20,190 --> 01:54:22,830 you'll presume that what cannot be easily measured really doesn't 1255 01:54:22,830 --> 01:54:28,380 exist anyway. And, of course, you know, that's everything that 1256 01:54:28,380 --> 01:54:31,650 we should, as a community avoid when we tried to measure impact 1257 01:54:31,650 --> 01:54:36,330 and attention. And that's a very high level description. I'll 1258 01:54:36,330 --> 01:54:40,590 move to a very low level practical example that I really 1259 01:54:40,590 --> 01:54:44,430 like. So a few years ago, a researcher from Peru named 1260 01:54:44,430 --> 01:54:47,130 Roxana Quispe Collantes defending her PhD so 1261 01:54:47,130 --> 01:54:51,390 that's, that's cool. But she was also the first person in history 1262 01:54:51,390 --> 01:54:56,520 to defend her PhD in Quechua, Quechua being one of the, not native but, 1263 01:54:56,520 --> 01:55:02,310 indigenous languages in that part of the world. So her work 1264 01:55:02,340 --> 01:55:06,150 itself generated tension her work itself has an impact but 1265 01:55:06,150 --> 01:55:10,740 the context, the fact of being the first person to defend a PhD 1266 01:55:10,740 --> 01:55:13,620 dissertation in Quechua was also important. It's important for 1267 01:55:13,620 --> 01:55:17,190 the Quechua speaking community for Peruvian universities and 1268 01:55:17,190 --> 01:55:21,930 also for other linguistic minorities. So her work and her 1269 01:55:22,290 --> 01:55:25,890 Quechua defense generated attention in national Spanish 1270 01:55:25,890 --> 01:55:30,750 language newspapers, it also generated attention from 1271 01:55:31,560 --> 01:55:34,740 international English language newspapers, and then we can 1272 01:55:34,740 --> 01:55:38,070 assume that she'll be eventually cited in Quechua language 1273 01:55:38,070 --> 01:55:42,720 Wikipedia. But then what happens? Can we measure the 1274 01:55:42,720 --> 01:55:46,110 attention can we measure the impact her work and her Quechua 1275 01:55:46,140 --> 01:55:50,850 defense had. And so my project is called Cobaltmetrics. The 1276 01:55:50,850 --> 01:55:53,730 other players in the field are Event Data, Altmetric.com 1277 01:55:53,760 --> 01:55:57,690 and Plum Analytics. We all collect data in very different 1278 01:55:57,690 --> 01:56:01,380 way, we all present data in very different ways. But in that 1279 01:56:01,380 --> 01:56:06,000 case, for different reasons, we all come to the same conclusion 1280 01:56:06,000 --> 01:56:11,700 that we do not have any trace of the attention generated by the 1281 01:56:11,700 --> 01:56:17,310 work or the defense. It's almost non existent, you can in altmetrics 1282 01:56:17,310 --> 01:56:22,080 for example, find traces of tweets regarding the papers 1283 01:56:22,080 --> 01:56:25,440 she published during her studies, but the dissertation 1284 01:56:25,440 --> 01:56:27,990 itself is absent from our systems, which of course, is 1285 01:56:28,020 --> 01:56:32,580 very problematic, and it's something we should stress. Um, 1286 01:56:33,210 --> 01:56:40,440 so the the main trick that impact data providers use is 1287 01:56:40,440 --> 01:56:45,780 that you want impact and we sell you attention. So citations 1288 01:56:45,780 --> 01:56:48,570 mentions and altmetrics are proxies for impact. And there's 1289 01:56:48,570 --> 01:56:52,230 nothing wrong with promises. It's a standard trick in 1290 01:56:52,230 --> 01:56:54,240 statistics and machine learning, when you cannot observe 1291 01:56:54,240 --> 01:56:57,720 something directly you observe something else, for which you 1292 01:56:57,720 --> 01:57:00,180 have good reasons to think that they are strongly tightly 1293 01:57:00,180 --> 01:57:00,870 correlated. 1294 01:57:02,130 --> 01:57:04,800 Citations, mentions and altmetrics measure attention, 1295 01:57:04,830 --> 01:57:08,520 attention does correlate with impact, but so do influence and 1296 01:57:08,520 --> 01:57:12,060 privilege. And of course I'm not the first person to talk about that 1297 01:57:12,060 --> 01:57:15,390 if you want a longer deeper discussion of attention versus 1298 01:57:15,390 --> 01:57:20,610 impact Sugimoto's paper is really a resource I recommend. 1299 01:57:22,410 --> 01:57:25,710 So about Colbaltmetrics. And then what we tried to do to 1300 01:57:25,710 --> 01:57:28,500 change things. We've been described as the new kid on the 1301 01:57:28,500 --> 01:57:30,720 block of altmetrics, and we tried to make altmetrics 1302 01:57:30,720 --> 01:57:35,640 genuinely alternative. And the main design principle is that we 1303 01:57:35,640 --> 01:57:38,430 do not define what is citable. So we have two complementary 1304 01:57:38,430 --> 01:57:41,340 services, the Citation Index, which is very comparable to what 1305 01:57:41,340 --> 01:57:45,240 other citation Indexes, whether it's Altmetric, or whether it's 1306 01:57:45,240 --> 01:57:49,470 Scopus, Web of Science offer. And then we have the URI 1307 01:57:49,470 --> 01:57:53,520 transmutation API, which sounds scary, but with a visual 1308 01:57:53,520 --> 01:57:57,900 example, I hope will be simple enough to to be interesting and 1309 01:57:57,900 --> 01:58:03,990 relevant. So to explain what we do, and then why it's important 1310 01:58:03,990 --> 01:58:07,320 in terms of fairness, and inclusiveness. Let's say you 1311 01:58:07,350 --> 01:58:10,890 have that small Knowledge Graph, you have people who are 1312 01:58:10,890 --> 01:58:13,980 contributors of any kind, they will be identified by their 1313 01:58:14,400 --> 01:58:17,310 persistent identifiers, for example, an orchid ID, an 1314 01:58:17,310 --> 01:58:20,280 institution, which is identified by GRID ID, and then you have a 1315 01:58:20,280 --> 01:58:24,630 research projects,a dataset, a paper maybe as a service with an 1316 01:58:24,630 --> 01:58:29,040 API, and they form a research route that you can walk to know 1317 01:58:29,610 --> 01:58:33,780 who contributed to what project, you know, what was generated by 1318 01:58:33,780 --> 01:58:37,950 that institution or that research project. And what we do 1319 01:58:37,950 --> 01:58:42,990 is that we add an extra layer on top of virtually any knowledge 1320 01:58:42,990 --> 01:58:47,970 graph with extra identifiers that are not necessarily 1321 01:58:47,970 --> 01:58:53,220 canonical standard, or persistence. So for example, if 1322 01:58:53,220 --> 01:58:57,120 you have an orcid ID, what we do is that you can start from 1323 01:58:57,120 --> 01:59:00,270 the from the email address, we'll give you the orchid ID and 1324 01:59:00,270 --> 01:59:03,300 then you can start working the graph. If you don't have a DOI, 1325 01:59:03,300 --> 01:59:05,700 you can start from the landing page, and the delicious website 1326 01:59:05,730 --> 01:59:08,040 will give you the DOI, and then you can access the Knowledge 1327 01:59:08,040 --> 01:59:11,790 Graph. So we're giving you extra entry points for you to access 1328 01:59:11,790 --> 01:59:15,720 the graph. And the second thing is that when we track mentions, 1329 01:59:16,440 --> 01:59:22,110 attention around research outputs, will not only look for 1330 01:59:22,110 --> 01:59:26,580 the nice, fancy DOI citations, but also for anything that we 1331 01:59:26,580 --> 01:59:33,240 know identify that resource. So we'll gather more citations and 1332 01:59:33,240 --> 01:59:37,260 more information around your content and will never have a 1333 01:59:37,290 --> 01:59:40,140 wow effect. But really, what we're trying to do is to observe 1334 01:59:40,140 --> 01:59:43,950 the long tail of research. And I guess you're familiar with the 1335 01:59:43,950 --> 01:59:47,610 term. When you observe the distribution of whether it's a 1336 01:59:47,610 --> 01:59:49,440 tensioning factor for dimensioning other metrics, 1337 01:59:49,440 --> 01:59:52,410 you'll have a heavily skewed long tail distribution where the 1338 01:59:52,530 --> 01:59:58,560 few items will generate a lot of attention and have an or have a 1339 01:59:58,560 --> 02:00:01,680 big impact and the most it tends will generate little to no 1340 02:00:01,680 --> 02:00:08,730 attention, and then maybe have little to no impacts. And the 1341 02:00:08,730 --> 02:00:12,270 technical node because there's something that we should look. 1342 02:00:12,270 --> 02:00:14,100 So I was trained as a linguist, and there's something 1343 02:00:14,100 --> 02:00:17,220 computational linguistics and computational biology that I 1344 02:00:17,220 --> 02:00:21,120 don't hear nearly enough in this community. It's the distinction 1345 02:00:21,120 --> 02:00:24,450 between a structural zero and a sampling zero. So very briefly, 1346 02:00:25,290 --> 02:00:28,440 when you sample data, and when you measure stuff, sampling zero 1347 02:00:28,440 --> 02:00:31,110 will be something you don't observe, because of the 1348 02:00:31,110 --> 02:00:35,010 limitations of your sampling method, or your data sample not 1349 02:00:35,010 --> 02:00:37,470 being big enough. So for example, if you have a system 1350 02:00:37,470 --> 02:00:41,520 that focuses on resources with persistent identifiers, any 1351 02:00:41,520 --> 02:00:44,610 resource that doesn't have a persistent identifier will go to 1352 02:00:44,610 --> 02:00:48,510 zero, but that's a sampling zero, then a structural zero is 1353 02:00:48,540 --> 02:00:52,080 unobserved due to there being non existent. So random example, 1354 02:00:52,080 --> 02:00:55,890 I've never written a book on birds or surveillance drones. So 1355 02:00:55,890 --> 02:00:59,550 if he looked into any impact measurement or attention 1356 02:00:59,580 --> 02:01:02,400 tracking, citation tracking, of course, that's going to be zero. 1357 02:01:02,400 --> 02:01:06,120 And that's a good structural zero. And I think, as data 1358 02:01:06,120 --> 02:01:08,820 providers, we should be able to distinguish between structural 1359 02:01:08,820 --> 02:01:10,920 zeros and sampling zeros. And there are there are statistical 1360 02:01:10,920 --> 02:01:13,740 methods to to fix that, or to address that. 1361 02:01:15,660 --> 02:01:18,030 So like I said, we're trying to make altmetrics genuinely 1362 02:01:18,390 --> 02:01:22,920 alternative, we crawl all types of web resources from, or we 1363 02:01:22,920 --> 02:01:25,710 monitor all types of web resources from from public 1364 02:01:25,710 --> 02:01:31,920 sources, we cover more than 300 different languages, 67 1365 02:01:31,920 --> 02:01:34,530 something like that, different types of identifiers that are 1366 02:01:34,530 --> 02:01:37,200 not necessarily persistent. The main limitation that we have is 1367 02:01:37,200 --> 02:01:40,140 that we're most limited by storage cost, because we need to 1368 02:01:40,140 --> 02:01:43,380 index a lot of things that are never going to be relevant. For 1369 02:01:43,380 --> 02:01:49,740 example, a Wikipedia page that cites a DOI is a citation, but 1370 02:01:49,740 --> 02:01:53,250 we could get paged that cites And Instagram page, for example, 1371 02:01:53,250 --> 02:01:56,190 is also a citation in our case, because we do not want to put 1372 02:01:56,190 --> 02:02:00,870 a priori filters on the data that collect. The consequence is that we 1373 02:02:00,870 --> 02:02:03,930 currently run Cobaltmetrics at a loss. But we're also the 1374 02:02:03,930 --> 02:02:06,750 living proof that other providers and mostly commercial 1375 02:02:06,750 --> 02:02:09,270 providers could be more inclusive, there's no technical 1376 02:02:09,270 --> 02:02:12,090 reason not to support and handle images, there's no technical 1377 02:02:12,090 --> 02:02:16,980 reason not to support 16 types of identifiers. The main reasons 1378 02:02:16,980 --> 02:02:19,680 is that smaller linguistic communities correlate with 1379 02:02:19,680 --> 02:02:22,170 smaller market segments. So it's not very interesting for 1380 02:02:22,170 --> 02:02:26,310 Altmetric, or Plum, to invest into those datasets and resources. 1381 02:02:28,740 --> 02:02:33,690 Um, so where do we go from here, I started by saying that impact 1382 02:02:33,690 --> 02:02:37,260 cannot be measured, I gave you a recipe of how not to measure 1383 02:02:37,260 --> 02:02:40,260 impact. And I've been I've given you five sad hard truths 1384 02:02:40,260 --> 02:02:43,350 about impact measurement. So I think for the community as a 1385 02:02:43,350 --> 02:02:47,520 whole, we need to adopt design principles that can be verified, 1386 02:02:47,520 --> 02:02:51,690 so do not declare that you're here for the greater good. And 1387 02:02:51,690 --> 02:02:54,660 so like, for example, in the POSI principles, so the 1388 02:02:54,660 --> 02:02:57,660 principles of open scholar infrastructure, one of the 1389 02:02:57,660 --> 02:03:01,110 principles is to be open source. And that's you can verify that 1390 02:03:01,110 --> 02:03:04,710 pretty easily. You're open source or you're not. But that 1391 02:03:04,710 --> 02:03:08,850 can be verified. And so I think that's important, because we 1392 02:03:08,850 --> 02:03:12,180 cannot evaluate what we cannot measure. And then I think we'll 1393 02:03:12,180 --> 02:03:15,630 send it of course being for funding and resourcing for open 1394 02:03:16,350 --> 02:03:19,230 slash fair infrastructure. And I think both words are important, 1395 02:03:19,230 --> 02:03:22,830 because open doesn't necessarily mean fair and fair, it doesn't 1396 02:03:22,830 --> 02:03:26,790 necessarily have to be open, then as far as we are concerned 1397 02:03:26,790 --> 02:03:31,770 for Cobaltmetrics, we might have to stop tracking attention. Because 1398 02:03:31,770 --> 02:03:35,010 we're a tiny, self funded organization. And we might have 1399 02:03:35,010 --> 02:03:37,440 to focus on our transmutation, which is our main contribution 1400 02:03:37,440 --> 02:03:44,010 to the community. And if we zoom out for a minute and think of 1401 02:03:44,040 --> 02:03:49,950 impact measurements, and how to better serve the community, from 1402 02:03:49,980 --> 02:03:52,230 a technical point of view, I think what's missing from the 1403 02:03:52,230 --> 02:03:56,310 conversation is that we only hear about success stories, and 1404 02:03:56,310 --> 02:04:00,450 you never hear about either nightmares, you know, stuff that 1405 02:04:00,450 --> 02:04:03,570 didn't work out and post mortems, or you never hear about 1406 02:04:03,570 --> 02:04:06,840 the limitations of the systems. And I think we should disclose 1407 02:04:06,840 --> 02:04:10,230 the limitations of our methods. A good example, if you're 1408 02:04:11,280 --> 02:04:14,340 maybe not aware of it, but there's a NISO working group 1409 02:04:14,340 --> 02:04:19,410 that developed a few years ago a self reporting table for 1410 02:04:19,440 --> 02:04:21,570 altmetrics providers according to their code of 1411 02:04:21,570 --> 02:04:23,910 conduct. And I think it's a really good way, Altmetric has 1412 02:04:23,910 --> 02:04:27,870 done a plan has done it. We've done it recently. It's not 1413 02:04:27,870 --> 02:04:30,930 machine readable, but it is very standardized. And it allows you 1414 02:04:30,930 --> 02:04:32,730 to compare the methods of different altmetrics 1415 02:04:32,730 --> 02:04:36,810 providers and to see what matches your values. I also 1416 02:04:36,810 --> 02:04:39,570 think we need to provide audit trails of how and when the data 1417 02:04:39,570 --> 02:04:43,650 was collected, and those audit trails or logs must be both 1418 02:04:43,650 --> 02:04:46,890 human and machine readable. I think that's very important. And 1419 02:04:46,890 --> 02:04:51,180 then I don't really, it's an idea I had a few days ago, and I 1420 02:04:51,180 --> 02:04:54,090 don't really know how to phrase that. But I think these audit 1421 02:04:54,090 --> 02:04:58,110 trails must be available to the end users. So if I as a data 1422 02:04:58,110 --> 02:05:02,520 provider, want to make sure that The limitations of my system are 1423 02:05:02,580 --> 02:05:07,560 open, maybe I have to force people who use my data to be 1424 02:05:07,560 --> 02:05:10,770 open about the limitations of everything that was part of the 1425 02:05:10,770 --> 02:05:13,740 chain for them to display an indicator that's either a number 1426 02:05:13,740 --> 02:05:18,330 of or visual indicator. And that links to conversations we had in 1427 02:05:18,330 --> 02:05:22,800 the first session about. So my idea was to use maybe viral 1428 02:05:22,830 --> 02:05:25,800 viral component of viral licenses, or Share-alike clauses 1429 02:05:25,800 --> 02:05:33,210 to use copyright tools to enforce non copyright values. So I'm 1430 02:05:33,210 --> 02:05:37,290 curious to see if people here have heard of that, or if they 1431 02:05:37,290 --> 02:05:40,350 think it's a really bad idea to re-use licenses for stuff that is 1432 02:05:40,350 --> 02:05:47,430 not really licensing. Um, and, yes, I'd like to finish by 1433 02:05:47,430 --> 02:05:50,670 recommending two books that are very, very important. They are 1434 02:05:50,700 --> 02:05:52,620 nothing to do with impact measurement. But I think they 1435 02:05:52,620 --> 02:05:55,230 could serve as cautionary tales of what could happen, 1436 02:05:55,230 --> 02:05:57,720 Weapons of mass destruction by Cathy O'Neil. The other one is 1437 02:05:57,720 --> 02:06:01,350 Automated inequality by Virginia Eubanks. And they are really 1438 02:06:01,350 --> 02:06:05,010 good, full of examples of how big data can increase 1439 02:06:05,010 --> 02:06:08,370 inequality. And I think we need to act before we can become 1440 02:06:08,370 --> 02:06:12,960 examples for the next books in the series. Thank you very much.