2015年04月30日 17:03 GMT+02:00 Ole Laursen <address@hidden>: > I'm seeing what looks to me like quadratic behaviour on pasting long > lines in readline applications like bash, python, psql. > > To reproduce: open a new terminal with say bash, make a really long > line (> 20k characters, e.g. with python -c 'print "a" * 20000'), then > paste it into the terminal. The first characters are inserted quickly, > but it gradually slows down. With enough data it ends up being so slow > as to be unusable. I did a profile now on examples/fileman with perf: https://perf.wiki.kernel.org/index.php/Tutorial#Sampling_with_perf_record Here are the results from pasting 5000 x "aaa " with my default locale da_DK.UTF-8. As is evident most of the time is spent measuring the widths of the a's: 21,22% fileman libc-2.19.so [.] __gconv_transform_utf8_internal 20,60% fileman libc-2.19.so [.] wcwidth 19,80% fileman libc-2.19.so [.] __mbrtowc 15,45% fileman fileman [.] rl_redisplay 12,44% fileman libc-2.19.so [.] __ctype_get_mb_cur_max 2,74% fileman libc-2.19.so [.] __x86.get_pc_thunk.bx 2,63% fileman libc-2.19.so [.] _dl_mcount_wrapper_check 1,42% fileman libc-2.19.so [.] __x86.get_pc_thunk.cx 1,34% fileman fileman [.] _rl_find_prev_mbchar 1,00% fileman fileman [.] 0x000014c0 0,57% fileman fileman [.] 0x00001520 0,12% fileman fileman [.] _rl_col_width 0,08% fileman libc-2.19.so [.] __memcmp_sse4_2 0,05% fileman fileman [.] update_line 0,03% fileman [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0,02% fileman [vdso] [.] __kernel_vsyscall 0,02% fileman libc-2.19.so [.] __memset_sse2_rep 0,01% fileman fileman [.] address@hidden 0,01% fileman [kernel.kallsyms] [k] enqueue_task_fair 0,01% fileman [i915] [k] gen6_read32 0,01% fileman [kernel.kallsyms] [k] mutex_unlock 0,01% fileman [kernel.kallsyms] [k] native_write_msr_safe 0,01% fileman libc-2.19.so [.] _IO_file_sync@@GLIBC_2.1 0,01% fileman libc-2.19.so [.] new_do_write 0,01% fileman [kernel.kallsyms] [k] resched_curr 0,01% fileman [kernel.kallsyms] [k] restore_nocheck 0,01% fileman [kernel.kallsyms] [k] insert_work 0,01% fileman [kernel.kallsyms] [k] timekeeping_update.constprop.7 0,01% fileman [kernel.kallsyms] [k] __queue_work 0,01% fileman [kernel.kallsyms] [k] __schedule 0,01% fileman [kernel.kallsyms] [k] n_tty_read 0,01% fileman [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0,01% fileman [kernel.kallsyms] [k] task_waking_fair 0,01% fileman [kernel.kallsyms] [k] tty_paranoia_check 0,01% fileman [kernel.kallsyms] [k] sysenter_past_esp 0,01% fileman [kernel.kallsyms] [k] vfs_read 0,01% fileman [kernel.kallsyms] [k] queue_work_on 0,01% fileman [kernel.kallsyms] [k] timerqueue_del 0,01% fileman libc-2.19.so [.] _IO_fwrite 0,01% fileman fileman [.] _rl_move_cursor_relative 0,01% fileman [kernel.kallsyms] [k] __copy_to_user_ll 0,01% fileman [kernel.kallsyms] [k] update_sd_lb_stats 0,01% fileman [kernel.kallsyms] [k] down_read 0,01% fileman [kernel.kallsyms] [k] wake_up_process 0,01% fileman [kernel.kallsyms] [k] native_sched_clock 0,01% fileman fileman [.] _rl_internal_char_cleanup 0,01% fileman [kernel.kallsyms] [k] security_file_permission 0,01% fileman fileman [.] _rl_dispatch_subseq 0,00% fileman [kernel.kallsyms] [k] cpuacct_charge 0,00% fileman [kernel.kallsyms] [k] memcpy 0,00% fileman [kernel.kallsyms] [k] apic_timer_interrupt 0,00% fileman libc-2.19.so [.] __GI___mempcpy 0,00% fileman fileman [.] rl_insert_text Compare this to a similar profile with LANG=C which is much faster: 61,71% fileman libc-2.19.so [.] __ctype_get_mb_cur_max 23,44% fileman fileman [.] rl_redisplay 5,01% fileman libc-2.19.so [.] __x86.get_pc_thunk.cx 4,87% fileman fileman [.] update_line 3,35% fileman fileman [.] 0x000014c0 0,07% fileman [kernel.kallsyms] [k] enqueue_task_fair 0,06% fileman [vdso] [.] __kernel_vsyscall 0,06% fileman [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0,05% fileman [kernel.kallsyms] [k] insert_work 0,05% fileman [kernel.kallsyms] [k] native_write_msr_safe 0,04% fileman [kernel.kallsyms] [k] restore_nocheck 0,04% fileman [kernel.kallsyms] [k] sysenter_past_esp 0,03% fileman fileman [.] _rl_dispatch_subseq 0,03% fileman [kernel.kallsyms] [k] fsnotify 0,03% fileman [kernel.kallsyms] [k] n_tty_write 0,03% fileman [kernel.kallsyms] [k] down_read 0,03% fileman [kernel.kallsyms] [k] _raw_spin_lock 0,03% fileman [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0,03% fileman libc-2.19.so [.] new_do_write 0,03% fileman fileman [.] readline_internal_char 0,03% fileman [kernel.kallsyms] [k] atomic64_read_cx8 0,02% fileman [kernel.kallsyms] [k] wake_up_worker 0,02% fileman [kernel.kallsyms] [k] __queue_work 0,02% fileman fileman [.] _rl_internal_char_cleanup 0,02% fileman [kernel.kallsyms] [k] rb_erase 0,02% fileman [kernel.kallsyms] [k] cpuacct_charge 0,02% fileman [kernel.kallsyms] [k] copy_from_read_buf 0,02% fileman [kernel.kallsyms] [k] apic_timer_interrupt 0,02% fileman [kernel.kallsyms] [k] update_curr 0,02% fileman fileman [.] rl_insert_text I'm assuming that readline does a redisplay on each inserted char? In this redisplay, it goes through all characters on the line inserted so far, so the end result is O(n^2). In a multi-byte case, examining each character is obviously much more expensive. Notice that in even with LANG=C, __ctype_get_mb_cur_max accounts for 61.7% of the time. As far as I can tell, this comes from MB_CUR_MAX which despite the naming is not a constant: http://man7.org/linux/man-pages/man3/MB_CUR_MAX.3.html As a quickfix for that, I'm attaching a patch to only compute MB_CUR_MAX once in the top of the function. With this patch, the LANG=C profile is noticeable faster and looks like this 87,55% fileman fileman [.] rl_redisplay 9,54% fileman fileman [.] update_line 0,41% fileman libc-2.19.so [.] __ctype_get_mb_cur_max 0,12% fileman [vdso] [.] __kernel_vsyscall 0,09% fileman [kernel.kallsyms] [k] sysenter_past_esp 0,07% fileman [kernel.kallsyms] [k] n_tty_read 0,06% fileman [kernel.kallsyms] [k] try_to_wake_up 0,06% fileman [kernel.kallsyms] [k] restore_nocheck 0,06% fileman [kernel.kallsyms] [k] resched_curr 0,06% fileman [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0,06% fileman [kernel.kallsyms] [k] atomic64_read_cx8 0,05% fileman [kernel.kallsyms] [k] tty_paranoia_check 0,05% fileman fileman [.] rl_insert_text This helps a bit with the UTF 8 case, but not nearly enough. I suppose one could try to make the loop content faster somehow, but to really fix this to not be O(n^2) one should only do the work necessary to take the single added character into account instead of starting off from scratch each time. Or perhaps try to tackle this case of copy-pasting a bucket-load of characters by batching them? What do you think? Ole
Attachment:
mb-cur-max-fix.patch
Description: Binary data