GenAI-HITL 기반 ‘독서와 작문’ 연계 서술형 평가 과제 개발 및 타당성 검토

박고운

한국교원대학교

국어교육학연구 60권 4호 129-174 (2025)

DOI: 10.20880/kler.2025.60.4.129

초록

본 연구는 생성형 인공지능(Generative AI)-인간 전문가 협력(HITL) 방식으로 2022 개정 국어과 ‘독서와 작문’ 성취기준에 부합하는 서술형 평가과제를 개발하고, 전문가 준거 기반 평정을 통해 교육적 타당성에 관한 예비적 타당화 근거를 제시하였다. 컨텍스트 엔지니어링과 사고연쇄(Chain of Thought)를 통합한 3단계 프로토콜을 설계하여 지문-문항-루브릭-해설이 연계된 과제를 생성하였다. 생성 결과물은 전국 13개 시도 현직 국어교사 18인이 3개 영역 12개 항목으로 평가하였다. 정량 분석 결과, 전체 평균은 4.32점으로 나타났으며 성취기준 부합도와 구조적 체계성에서 높은 평정을 받았다. 평가 도구의 내적 일관성과 평정자 간 합치도 역시 양호한 수준으로확인되었다. 반면 학습자 수준 적합성은 상대적으로 낮아, AI가 형식화 가능한 교육과정 요소는 구현하되 학습자 발달 단계 및 학급 맥락과 같은 비형식적 요소 반영에는 한계가 있음을 시사하였다. 본 연구는 AI 생성 과제가 교사가 조정 가능한 초안으로 기능할 수 있음을 확인하였으며, 형식적 타당성과 실질적 타당성의 균형이 인간-AI 협력을 통해 강화될 가능성을 제시한다.

키워드

자동 문항 생성생성형 AI인간-AI 협력서술형 평가독서와 작문2022 개정 교육과정

참고문헌

[1] [단행본] 학생의 사고력과 문제해결력을 키우는 중등 논술형 평가 길라잡이/경기도교육청/경기도교육청/2024/~///
[2] [학술지(정기간행물)] 곽선영/17개 교육청의 서·논술형 평가 지침 비교/함께 여는 국어교육/2025/157/84~97//
[3] [보고서] 제6차 교육과정(교육부 고시 제1992-11호)/교육부/교육부/1992/~/
[4] [보고서] 2022 개정 국어과 교육과정(교육과정 고시 제2022-33호)/교육부/교육부/2022/~/
[5] [학술지(정기간행물)] 권태현/국어과 평가의 문제점과 체계화 방안 - 수행과 지필 평가의 균형적 접근을 중심으로/어문론집/2021/85/359~394//
[6] [학술지(정기간행물)] 김경희/서·논술형 평가의 평가학적 의미 탐색/교육평가연구/2020/33(4)/839~862//
[7] [단행본] 사고력 함양을 위한 서·논술형 평가 도구 개발 이론과 실제/김선/AMEC/2023/~///
[8] [학술지(정기간행물)] 김형성/국어 교사의 논술형 평가 전문성 검사 도구 개발/새국어교육/2023/136/167~208//
[9] [학술지(정기간행물)] 남민우/국어과 평가 문항의 양호도 분석틀개발을 위한 기초 연구/청람어문교육/2022/86/71~95//
[10] [학술지(정기간행물)] 박고운/국어과 읽기 영역 선다형 평가를 위한 자동 문항 생성 방안 연구/교육과정평가연구/2025/28(1)/215~246//
[11] [학술지(정기간행물)] 박고운/GAI-HITL 기반 독서 문항 자동 생성(AIG)의 심리측정학적 타당성분석 연구/교육과정평가연구/2025/28(3)/319~359//
[12] [학술지(정기간행물)] 박도순/서·논술형 평가 시행에 관한 고찰/함께 여는 국어교육/2025/157/242~247//
[13] [학술지(정기간행물)] 박종임/국어과 서·논술형 평가의 도입 현황 및 실행 상의 쟁점 탐색 연구/청람어문교육/2024/101/273~307//
[14] [보고서] 컴퓨터 기반 서·논술형 평가를 위한 자동채점 방안 설계(Ⅰ)/박종임/한국교육과정평가원/2022/~/
[15] [보고서] 수업-평가 연계 강화를 통한 서·논술형 평가 내실화 방안/박혜영/한국교육과정평가원/2019/~/
[16] [보고서] 서·논술형 평가도구 자료집(국어과)/서울특별시교육청/한국교육과정평가원/2022/~/
[17] [단행본] 교육평가의 기초/성태제/학지사/2019/~///
[18] [학술지(정기간행물)] 송슬기/깊이 있는 학습을 위한 필요조건으로서의 논술형 평가의 특징과 지원 방향에 관한 탐색/교육문화연구/2024/30(4)/149~172//
[19] [학술지(정기간행물)] 장성민/대학수학능력시험 서·논술형 평가 도입의 철학적 정당화와 방향 탐색/작문연구/2021/51/117~151//
[20] [학술지(정기간행물)] 장성민/도구 교과로서의 역할을 고려한 표현론적 관점에서의 학문 문식성 구체화 방향 탐색: 수능 서·논술형 문항 설계를 위한 논증 과제 분류를 중심으로/작문연구/2024/62/51~90//
[21] [학술지(정기간행물)] 정민주/좋은 국어과 평가 문항 특성에 관한 질적 분석 연구: 국어과 평가 문항 양호도 분석틀 개발 연구(2)/청람어문교육/2022/89/43~78//
[22] [학술지(정기간행물)] 최숙기/서·논술형 수능 도입을 대비한 2022 개정 국어과 교육과정의 개정 방향 탐색/청람어문교육/2021/83/129~156//
[23] [학술지(정기간행물)] 최숙기/국어과 서·논술형 수능 평가 문항 개발 방안 연구/청람어문교육/2023/91/135~178//
[24] [학술지(정기간행물)] 최숙기/2022 개정 국어과 교육과정 <독서와 작문> 교육과정 개발의 원리와 방향/작문연구/2023/57/165~199//
[25] [학술지(정기간행물)] 최숙기/생성형 AI를 활용한 현직 국어교사의 서·논술형 평가 문항 개발 양상분석/청람어문교육/2024/97/243~270//
[26] [인터넷자원] 서·논술형 평가 도구 개발의 방법과 사례/https://stas.moe.go.kr/bbs/artcl/artclDtl:EVAL_TASK_DEV_S3?page=0&size=10&redraw=&totalPages=6&sBbsId=EVAL_TASK_DEV_S3&sArtclSeq=500658&sFileKey=&sCprtYn=Y&sCond=ARTCL_TITLE&sWord=/학생평가지원포털/20250623/학생평가지원포털/20241231
[27] [단행본] 2026학년도 수능특강: 국어영역 독서/한국교육방송공사/한국교육방송공사/2025/~///
[28] [학술지(정기간행물)] 함은혜/GPT를 활용한 서술형 문항 생성 프로토콜과문항의 질 평가: 국어과 사례를 중심으로/교육학연구/2024/62(8)/63~93//
[29] [단행본] A taxonomy for learning, teaching, and assessing: A revision of Bloom's taxonomy of educational objectives/Anderson, L. W./Longman/2001/~///
[30] [학술지(정기간행물)] Attali Y./The interactive reading task: Transformer-based automatic item generation/Frontiers in Artificial Intelligence/2022/5/903077~//
[31] [학술대회논문] Bender, E. M./On the dangers of stochastic parrots: Can language models be too big?/Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency/2021//610~623/Association for Computing Machinery
[32] [학술지(정기간행물)] Bozkurt A./Tell me your prompts and I will make them true: The alchemy of prompt engineering and generative AI/Open Praxis/2024/16(2)/111~118//
[33] [단행본] Metacognition, motivation, and understanding/Brown, A. L./Lawrence Erlbaum Associates/1987/65~116///Metacognition, executive control, self - regulation, and other more mysterious mechanisms
[34] [학술지(정기간행물)] Circi R/Automatic item generation: Foundations and machine-learning -based approaches for assessments/Frontiers in Education/2023/8/858273~//
[35] [보고서] AP Seminar - End-of -Course Exam Scoring Guidelines/College Board/The College Board/2019/~/
[36] [학술대회논문] Dhuliawala, S./Chain - of -verification reduces hallucination in large language models/Findings of the Association for Computational Linguistics: ACL 2024/2024//3563~3578/
[37] [학술지(정기간행물)] Eager, B./Prompting higher education towards AI -augmented teaching and learning practice/Journal of University Teaching & Learning Practice/2023/20(5)/2~//
[38] [학술지(정기간행물)] Fitzgerald, J./Reading and writing relations and their development/Educational Psychologist/2000/35(1)/39~50//
[39] [기타자료] Ganguli, D./Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned/arXiv preprint. arXiv:2209.07858/2022/~//
[40] [인터넷자원] Gemini 2.5 Pro: Model card/https://storage.googleapis.com/model-cards/documents/gemini-2.5-pro.pdf/Google DeepMind//Google Cloud Storage/20250627
[41] [학술지(정기간행물)] /////~//
[42] [학술지(정기간행물)] Kan, A./Crossed random-effect modelling: Examining the effects of teacher experience and rubric use in performance assessments/Eurasian Journal of Educational Research/2014/57/1~28//
[43] [학술지(정기간행물)] Kane, M. T./Validating the interpretations and uses of test scores/Journal of Educational Measurement/2013/50(1)/1~73//
[44] [학술대회논문] Kharrufa A./The Potential and Implications of Generative AI on HCI Education/Proceedings of the 6th Annual Symposium on HCI Education (EduCHI '24)/2024//1~8/Association for Computing Machinery
[45] [학술지(정기간행물)] Koo T. K./A guideline of selecting and reporting intraclass correlation coefficients for reliability research/Journal of Chiropractic Medicine/2016/15(2)/155~163//
[46] [단행본] Educational testing and measurement: Classroom application and practice/Kubiszyn, T./John Wiley & Sons/2013/~///
[47] [보고서] Einheitliche Prüfungsanforderungen in der Abiturprüfung Deutsch/Kultusministerkonferenz/Kultusministerkonferenz/2002/~/
[48] [학술대회논문] Lewis, P./Retrieval -augmented generation for knowledge-intensive NLP tasks/Advances in Neural Information Processing Systems/2020/33/9459~9474/
[49] [기타자료] Lightman H./Let's verify step by step/arXiv preprint arXiv:2305.20050/2023/~//
[50] [기타자료] Madaan, A./Self -Refine: Iterative refinement with self - feedback/arXiv preprint arXiv:2305.17651/2023/~//
[51] [보고서] Best practices for constructed - response scoring/McCaffrey, D. F./Educational Testing Service/2022/~/
[52] [단행본] Classroom assessment: Principles and practice for effective standards - based instruction (6th ed.)/McMillan, J. H./Pearson/2014/~///
[53] [학술지(정기간행물)] Memarian B./Human-in - the -loop in artificial intelligence in education: A review and entity - relationship (ER) analysis/Computers in Human Behavior: Artificial Humans/2024/2(1)/100053~//
[54] [단행본] Educational measurement/Messick, S./Macmillan/1989/13~103///Validity
[55] [단행본] Measurement and assessment in teaching/Miller, M. D./Pearson Education/2013/~///
[56] [인터넷자원] OpenAI o3 and o4 -mini: System card/https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf/OpenAI///20250416
[57] [학술지(정기간행물)] Qian, Y./Prompt engineering in education: A systematic review of approaches and educational applications/Journal of Educational Computing Research/2025/63(7-8)/~//
[58] [단행본] The reflective practitioner: How professionals think in action/Schön, D. A./Basic Books/1983/~///
[59] [기타자료] Shah C./From prompt engineering to prompt science with human in the loop/arXiv preprint arXiv:2401.04122/2024/~//
[60] [학술지(정기간행물)] Shrout, P. E./Intraclass correlations: Uses in assessing rater reliability/Psychological Bulletin/1979/86(2)/420~428//
[61] [학술지(정기간행물)] Tavakol M./Making sense of Cronbach's alpha/International Journal of Medical Education/2011/2/53~55//
[62] [보고서] Artificial intelligence and the future of teaching and learning: Insights and recommendations/U.S. Department of Education/Office of Educational Technology/2023/~/
[63] [학술대회논문] Wang L./Plan -andsolve prompting: Improving zero - shot chain - of - thought reasoning by large language models/Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)/2023/1/2609~2634/Association for Computational Linguistics
[64] [단행본] Webb's Depth - of - Knowledge Guide: Career and technical education definitions/Webb, N. L./Wisconsin Center for Education Research, University of Wisconsin - Madison/2009/~///
[65] [학술대회논문] Wei, J./Chain -of -Thought prompting elicits reasoning in large language models/Proceedings of the 36th Conference on Neural Information Processing Systems(NeurIPS 2022)/2022/1800/24824~24837/
[66] [기타자료] White, J./A prompt pattern catalog to enhance prompt engineering with ChatGPT/arXiv preprint arXiv:2302.11382/2023/~//
[67] [보고서] Shaping the future of learning: The role of AI in Education 4.0/World Economic Forum/World Economic Forum/2024/~/
[68] [학술지(정기간행물)] Zanzotto, F. M./Viewpoint: Human-in - the - loop artificial intelligence/Journal of Artificial Intelligence Research/2019/64(1)/243~252//

논문 정보

초록

키워드

참고문헌