Hierarchical Dual-System Reinforcement Learning for Long-Horizon Autonomous Planning with Large Language Models

Nathan R. Lawrence; Kaihui Shao

Authors

Nathan R. Lawrence School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA.
Kaihui Shao Department of Computer Science, University of North Texas, Denton, TX, USA.

Keywords:

hierarchical reinforcement learning, dual-system theory, large language models, long-horizon planning, autonomous systems, socio-technical infrastructure, governance

Abstract

This paper introduces a hierarchical dual-system reinforcement learning framework designed to address the challenges of long-horizon autonomous planning in environments where large language models serve as both reasoning components and planning priors. The proposed architecture draws upon the cognitive distinction between fast, intuitive reasoning and slow, deliberative reasoning, adapting it to a two-tier reinforcement learning hierarchy. At the lower level, a high-frequency control system learns primitive actions and local policies through trial-and-error interaction, while the upper level employs a deliberative system that leverages pretrained large language models to generate abstract subgoals, evaluate long-term consequences, and restructure task representations. The integration of large language models into this hierarchy introduces both opportunities and structural tensions, including issues of computational cost, semantic grounding, real-time adaptability, and ethical governance. This paper examines the system-level trade-offs inherent in such an architecture, focusing on deployment robustness, fairness in planning outcomes, sustainability of large-scale inference, and the policy implications of embedding generative models within autonomous planning pipelines. Through case illustrations in domains such as robotic navigation, logistics scheduling, and automated scientific experimentation, we analyze how the dual-system hierarchy can mitigate the brittleness of purely language-driven planning while retaining the flexibility of neural reasoning. The paper concludes by outlining a research agenda for improving the transparency, reliability, and scalability of hierarchical dual-system RL systems in real-world infrastructures.

References

1. Kahneman, D. (2011). Thinking, fast and slow. Farrar, Straus and Giroux.

2. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press.

3. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

4. Barto, A. G., & Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4), 341-379.

5. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901.

6. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., ... & Chi, E. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824-24837.

7. Botvinick, M., Ritter, S., Wang, J. X., Kurth-Nelson, Z., Blundell, C., & Hassabis, D. (2019). Reinforcement learning, fast and slow. Trends in Cognitive Sciences, 23(5), 408-422.

8. Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., ... & Zhang, A. (2022). Do as I can, not as I say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.

9. Huang, W., Abbeel, P., Pathak, D., & Mordatch, I. (2022). Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207.

10. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.

11. Lu, Y., Zhong, Y., & Sievert, S. (2023). Language models as world models for reinforcement learning. In International Conference on Machine Learning (pp. 23156-23182). PMLR.

12. Dou, Z., Cui, D., Yan, J., Wang, W., Chen, B., Wang, H., ... & Zhang, S. (2025). Dsadf: Thinking fast and slow for decision making. arXiv preprint arXiv:2505.08189.

13. Schramowski, P., Turhan, C., Jentzsch, S., Rothkopf, C., & Kersting, K. (2022). The moral debate: Language models as moral judges. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society (pp. 620-630).

14. Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3645-3650).

15. Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sunderhauf, N., ... & van den Hengel, A. (2018). Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3674-3683).

16. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 53728-53741.

17. Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning (pp. 49-58).

18. Duan, Y., Chen, X., Houthooft, R., Schulman, J., & Abbeel, P. (2016). Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning (pp. 1329-1338).

19. Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R. H., Czechowski, K., ... & Michalewski, H. (2019). Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374.

20. Hanna, J. P., & Stone, P. (2017). Grounded action transformation for robot learning from demonstration. In Proceedings of the 2017 International Conference on Autonomous Agents and Multiagent Systems (pp. 876-884).

21. Shu, T., Bhandwaldar, A., Gan, C., Smith, K., Liu, S., Gutfreund, D., ... & Ullman, T. (2020). AGENT: A benchmark for core psychological reasoning. In International Conference on Machine Learning (pp. 8830-8841).

Hierarchical Dual-System Reinforcement Learning for Long-Horizon Autonomous Planning with Large Language Models

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Journal Information

Indexing & Infrastructure

Current Issue

Information

Make a Submission