机器学习和自然语言处理案例研究

The digitalization of society has led to vast amounts of new data that also come in new types. 从捕获字段中的事件的事务数据中获取, 从电子健康记录到传感器的地理位置, 图片, 或文本, we have developed methods and tools that make sense of this wealth of information. 机器学习(ML), 它能够从所有类型的数据中提取规则模式, opens new possibilities for our researchers looking to augment traditional research techniques.

与主题专家和方法学家合作, our data scientists develop applications for natural language processing (NLP) using both traditional and cutting-edge deep learning models in a variety of tasks––from the identification of key information in interviewer comments in traditional surveys to the classification of clinical notes in electronic health records.

We embed ML models in data collection projects to identify the most cost-effective strategy to gain cooperation from survey respondents or to detect potential interview falsification. 使用这些新方法, 我们已经建立了新的工具来从图像中提取见解, 视频, 或者音频文件，提高数据采集的效率, 评价, 和分析.

药物滥用警告网络

The Substance Abuse and Mental Health 服务 Administration’s (SAMHSA’s) DAWN study collects data in 50 hospitals across the United States. The goals are to (1) identify new and emerging drugs and use patterns, (2)成为毒品事件预警系统, (3)产生即时可用的数据. Our challenge is to provide continuous review of emergency department (ED) records to identify key data elements in drug- and alcohol-related visits.

确保严格的数据质量并保持低成本, 趣赢平台 developed ML models to review and route DAWN data to expert reviewers who must decide whether a drug caused or contributed to a person’s ED visit. The models 趣赢平台 developed assign a probability score indicating whether the ED visit is likely to be in scope for DAWN and the likely category of the visit. These models are retrained periodically to increase their efficiency. The result is that DAWN data are of very high quality without relying on human review of each case.

全国糖尿病监测

As part of our work for the CDC’s national diabetes surveillance strategy, 趣赢平台 developed and fielded a telephone survey of patients with diabetes in a large health system and acquired matching EHR data for the survey sample. 通过链接这两个数据源, 趣赢平台 was able to validate survey-based and EHR-based algorithms to determine patients’ type of diabetes against a “gold standard” diagnosis achieved by manual review of patient charts. 使用有监督的ML模型, we were able to develop a conditional inference tree that classified each adult patient into type 1, 2型, 或者其他糖尿病类型，准确率非常高.

医疗开支事务组统计调查

数据收集过程中, field interviewers often append electronic notes or “comments” to a case in open text fields to request updates to case-level data. These comments might contain actionable information that alerts data technicians to unusual responses or circumstances that can affect data quality. Trends in topics or content of the comments may provide valuable insights on imperfect question design, 训练间隙, 或者来自面试官的偏见.

At the same time comments are often superfluous or do not contain enough detail to be actionable, 而且处理评论非常耗时. The ability to reliably assess these comments and apply standardized data editing procedures quickly is key to improving data quality and increasing efficiency.

趣赢平台 developed a novel application of ML technologies to assist in the 评价 of these comments. 使用来自欧洲议会议员的数千条评论, we built features that were fed to a ML model to predict a grouping category for each comment. The model achieved high accuracy and was incorporated into a production tool for editing. A qualitative 评价 of the tool also provided encouraging results. This application of ML created an increase in processing efficiency while maintaining exacting standards for data quality.

临床研究

生物统计学和流行病学

机器学习和自然语言处理案例研究

药物滥用警告网络

全国糖尿病监测

医疗开支事务组统计调查

我们能帮什么忙??

想和我们一起工作?