πŸš€ KesslerTech

Why does C code for testing the Collatz conjecture run faster than hand-written assembly

Why does C code for testing the Collatz conjecture run faster than hand-written assembly

πŸ“… | πŸ“‚ Category: C++

The Collatz conjecture, a seemingly elemental mathematical job, has baffled mathematicians for many years. It posits that repeatedly making use of a circumstantial algorithm to immoderate affirmative integer volition yet pb to the figure 1. Piece the conjecture stays unproven, investigating it for huge numbers gives invaluable insights. Amazingly, C++ codification frequently outperforms manus-written meeting codification for these exams. This seemingly counterintuitive consequence reveals a batch astir contemporary compilers, optimization strategies, and the complexities of show tuning.

Compiler Optimizations: The Unsung Heroes

Contemporary C++ compilers are extremely blase. They employment a broad array of optimization methods that frequently surpass the capabilities of equal expert meeting programmers. These optimizations see education reordering, registry allocation, loop unrolling, and vectorization. A compiler tin analyse the full codebase and brand planetary optimizations that a quality penning meeting mightiness girl.

For case, once dealing with the Collatz conjecture, a compiler tin place repetitive calculations and optimize them for velocity. It tin besides leverage the powerfulness of SIMD (Azygous Education, Aggregate Information) directions to procedure aggregate numbers concurrently, importantly accelerating the investigating procedure.

Moreover, compilers are perpetually being up to date and improved. They payment from years of investigation and improvement, incorporating algorithms and methods that are astatine the forefront of machine discipline. This steady betterment makes them formidable instruments for show optimization.

Abstraction and Maintainability

C++ presents a increased flat of abstraction than meeting communication. This permits builders to direction connected the algorithm’s logic instead than the intricacies of device directions. This improved readability leads to sooner improvement occasions and simpler care, important for analyzable initiatives similar investigating the Collatz conjecture for ample numbers.

Ideate making an attempt to debug a analyzable algorithm written successful meeting. Tracing done registers and representation addresses tin beryllium a nightmare. C++ debuggers, connected the another manus, message a overmuch friendlier situation, simplifying the improvement procedure and lowering the probability of errors.

This increased flat of abstraction besides permits for codification reusability and portability. C++ codification tin beryllium easy tailored to antithetic architectures, piece meeting codification frequently requires important rewriting.

The Quality Cause: Limitations and Errors

Manus-written meeting, piece possibly almighty, is inclined to quality mistake. Equal skilled meeting programmers tin brand errors that contact show. Optimizing meeting codification requires an intimate knowing of the circumstantial CPU structure, together with its pipeline, caching mechanisms, and education latencies.

See the complexity of managing registers, representation entree, and subdivision prediction manually. A tiny oversight tin pb to important show bottlenecks. Compilers, nevertheless, are designed to grip these complexities routinely, lowering the hazard of quality-induced errors.

Furthermore, sustaining manus-optimized meeting codification is a difficult project. Arsenic hardware evolves, the optimum codification essential beryllium tailored and retested, a procedure that is clip-consuming and mistake-susceptible.

Lawsuit Survey: Distributed Collatz Investigating

Respective initiatives person leveraged distributed computing to trial the Collatz conjecture for extremely ample numbers. These initiatives frequently make the most of C++ for its show and portability. By distributing the workload crossed aggregate machines, researchers tin research huge numerical areas, pushing the boundaries of our knowing of the conjecture.

1 specified task makes use of a modified interpretation of the Boinc level to harness the powerfulness of volunteers’ computer systems. The task’s codebase, written chiefly successful C++, permits for businesslike parallelization and connection betwixt the distributed nodes.

  • C++ permits for businesslike parallelization, important for ample-standard computations.
  • Distributed computing drastically expands the scope of numbers testable for the Collatz conjecture.

The occurrence of these distributed tasks highlights the practicality and ratio of C++ for computationally intensive duties. Piece meeting mightiness message possible show beneficial properties successful circumstantial eventualities, the advantages of C++’s greater-flat abstractions and compiler optimizations frequently outweigh the attempt required for manus-optimization.

Infographic Placeholder: C++ vs. Meeting for Collatz Investigating

[Infographic evaluating show, improvement clip, and maintainability of C++ and Meeting for Collatz conjecture investigating]

Often Requested Questions (FAQ)

Q: Tin meeting always outperform C++ for the Collatz conjecture?

A: Successful extremely specialised eventualities, and with meticulous manus-optimization, meeting mightiness accomplish marginal show positive factors. Nevertheless, the complexity and care load frequently brand C++ a much applicable prime.

  1. Analyse the circumstantial algorithm.
  2. Chart the C++ codification to place bottlenecks.
  3. Cautiously optimize the meeting codification for the mark structure.

Contemplating the fast developments successful compiler application, the show spread betwixt C++ and meeting continues to shrink.

The ratio of C++ successful investigating the Collatz conjecture stems from a operation of almighty compiler optimizations, advanced-flat abstractions, and the inherent complexities of manus-written meeting. Piece meeting provides possible advantages successful circumstantial conditions, the practicality, maintainability, and evolving powerfulness of C++ compilers brand it a compelling prime for analyzable computational duties. Research additional by delving into compiler plan and precocious optimization strategies. Larn much astir the Collatz conjecture itself and the intricacies of compiler optimization. For insights into show profiling, seek the advice of assets similar this Wikipedia article connected profiling. By knowing these underlying ideas, you tin brand knowledgeable choices astir selecting the correct instruments for your adjacent computationally intensive task.

  • Compiler optimization is a perpetually evolving tract.
  • The champion prime of communication relies upon connected the circumstantial task necessities.

Question & Answer :
I wrote these 2 options for Task Euler Q14, successful meeting and successful C++. They instrumentality equivalent brute unit attack for investigating the Collatz conjecture. The meeting resolution was assembled with:

nasm -felf64 p14.asm && gcc p14.o -o p14 

The C++ was compiled with:

g++ p14.cpp -o p14 

Meeting, p14.asm:

conception .information fmt db "%d", 10, zero planetary chief extern printf conception .matter chief: mov rcx, one million xor rdi, rdi ; max i xor rsi, rsi ; i l1: dec rcx xor r10, r10 ; number mov rax, rcx l2: trial rax, 1 jpe equal mov rbx, three mul rbx inc rax jmp c1 equal: mov rbx, 2 xor rdx, rdx div rbx c1: inc r10 cmp rax, 1 jne l2 cmp rdi, r10 cmovl rdi, r10 cmovl rsi, rcx cmp rcx, 2 jne l1 mov rdi, fmt xor rax, rax call printf ret 

C++, p14.cpp:

#see <iostream> int series(agelong n) { int number = 1; piece (n != 1) { if (n % 2 == zero) n /= 2; other n = three*n + 1; ++number; } instrument number; } int chief() { int max = zero, maxi; for (int i = 999999; i > zero; --i) { int s = series(i); if (s > max) { max = s; maxi = i; } } std::cout << maxi << std::endl; } 

I cognize astir the compiler optimizations to better velocity and every part, however I don’t seat galore methods to additional optimize my meeting resolution (talking programmatically, not mathematically).

The C++ codification makes use of modulus all word and part all another word, piece the meeting codification lone makes use of a azygous part all another word.

However the meeting is taking connected mean 1 2nd longer than the C++ resolution. Wherefore is this? I americium asking chiefly retired of curiosity.

Execution occasions

My scheme: sixty four-spot Linux connected 1.four GHz Intel Celeron 2955U (Haswell microarchitecture).

If you deliberation a sixty four-spot DIV education is a bully manner to disagreement by 2, past nary wonderment the compiler’s asm output bushed your manus-written codification, equal with -O0 (compile accelerated, nary other optimization, and shop/reload to representation last/earlier all C message truthful a debugger tin modify variables).

Seat Agner Fog’s Optimizing Meeting usher to larn however to compose businesslike asm. Helium besides has education tables and a microarch usher for circumstantial particulars for circumstantial CPUs. Seat besides the x86 tag wiki for much perf hyperlinks.

Seat besides this much broad motion astir beating the compiler with manus-written asm: Is inline meeting communication slower than autochthonal C++ codification?. TL:DR: sure if you bash it incorrect (similar this motion).

Normally you’re good letting the compiler bash its happening, particularly if you attempt to compose C++ that tin compile effectively. Besides seat is meeting sooner than compiled languages?. 1 of the solutions hyperlinks to these neat slides displaying however assorted C compilers optimize any truly elemental features with chill tips. Matt Godbolt’s CppCon2017 conversation β€œWhat Has My Compiler Performed for Maine Currently? Unbolting the Compiler’s Lid” is successful a akin vein.


equal: mov rbx, 2 xor rdx, rdx div rbx 

Connected Intel Haswell, div r64 is 36 uops, with a latency of 32-ninety six cycles, and a throughput of 1 per 21-seventy four cycles. (Positive the 2 uops to fit ahead RBX and zero RDX, however retired-of-command execution tin tally these aboriginal). Advanced-uop-number directions similar DIV are microcoded, which tin besides origin advance-extremity bottlenecks. Successful this lawsuit, latency is the about applicable cause due to the fact that it’s portion of a loop-carried dependency concatenation.

shr rax, 1 does the aforesaid unsigned part: It’s 1 uop, with 1c latency, and tin tally 2 per timepiece rhythm.

For examination, 32-spot part is sooner, however inactive horrible vs. shifts. idiv r32 is 9 uops, 22-29c latency, and 1 per eight-11c throughput connected Haswell.


Arsenic you tin seat from trying astatine gcc’s -O0 asm output (Godbolt compiler explorer), it lone makes use of shifts directions. clang -O0 does compile naively similar you idea, equal utilizing sixty four-spot IDIV doubly. (Once optimizing, compilers bash usage some outputs of IDIV once the origin does a part and modulus with the aforesaid operands, if they usage IDIV astatine each)

GCC doesn’t person a wholly-naive manner; it ever transforms done GIMPLE, which means any “optimizations” tin’t beryllium disabled. This consists of recognizing part-by-changeless and utilizing shifts (powerfulness of 2) oregon a fastened-component multiplicative inverse (non powerfulness of 2) to debar IDIV (seat div_by_13 successful the supra godbolt nexus).

gcc -Os (optimize for dimension) does usage IDIV for non-powerfulness-of-2 part, unluckily equal successful circumstances wherever the multiplicative inverse codification is lone somewhat bigger however overmuch quicker.


Serving to the compiler

(abstract for this lawsuit: usage uint64_t n)

Archetypal of each, it’s lone absorbing to expression astatine optimized compiler output. (-O3).
-O0 velocity is fundamentally meaningless.

Expression astatine your asm output (connected Godbolt, oregon seat However to distance “sound” from GCC/clang meeting output?). Once the compiler doesn’t brand optimum codification successful the archetypal spot: Penning your C/C++ origin successful a manner that guides the compiler into making amended codification is normally the champion attack. You person to cognize asm, and cognize what’s businesslike, however you use this cognition not directly. Compilers are besides a bully origin of ideas: generally clang volition bash thing chill, and you tin manus-clasp gcc into doing the aforesaid happening: seat this reply and what I did with the non-unrolled loop successful @Veedrac’s codification beneath.)

This attack is moveable, and successful 20 years any early compiler tin compile it to any is businesslike connected early hardware (x86 oregon not), possibly utilizing fresh ISA delay oregon car-vectorizing. Manus-written x86-sixty four asm from 15 years agone would normally not beryllium optimally tuned for Skylake. e.g. comparison&subdivision macro-fusion didn’t be backmost past. What’s optimum present for manus-crafted asm for 1 microarchitecture mightiness not beryllium optimum for another actual and early CPUs. Feedback connected @johnfound’s reply discourse great variations betwixt AMD Bulldozer and Intel Haswell, which person a large consequence connected this codification. However successful explanation, g++ -O3 -march=bdver3 and g++ -O3 -march=skylake volition bash the correct happening. (Oregon -march=autochthonal.) Oregon -mtune=... to conscionable tune, with out utilizing directions that another CPUs mightiness not activity.

My feeling is that guiding the compiler to asm that’s bully for a actual CPU you attention astir shouldn’t beryllium a job for early compilers. They’re hopefully amended than actual compilers astatine uncovering methods to change codification, and tin discovery a manner that plant for early CPUs. Careless, early x86 most likely received’t beryllium unspeakable astatine thing that’s bully connected actual x86, and the early compiler volition debar immoderate asm-circumstantial pitfalls piece implementing thing similar the information motion from your C origin, if it doesn’t seat thing amended.

Manus-written asm is a achromatic-container for the optimizer, truthful changeless-propagation doesn’t activity once inlining makes an enter a compile-clip changeless. Another optimizations are besides affected. Publication https://gcc.gnu.org/wiki/DontUseInlineAsm earlier utilizing asm. (And debar MSVC-kind inline asm: inputs/outputs person to spell done representation which provides overhead.)

Successful this lawsuit: your n has a signed kind, and gcc makes use of the SAR/SHR/Adhd series that provides the accurate rounding. (IDIV and arithmetic-displacement “circular” otherwise for antagonistic inputs, seat the SAR insn fit ref handbook introduction). (IDK if gcc tried and failed to be that n tin’t beryllium antagonistic, oregon what. Signed-overflow is undefined behaviour, truthful it ought to person been capable to.)

You ought to person utilized uint64_t n, truthful it tin conscionable SHR. And truthful it’s moveable to methods wherever agelong is lone 32-spot (e.g. x86-sixty four Home windows).


BTW, gcc’s optimized asm output seems beautiful bully (utilizing unsigned agelong n): the interior loop it inlines into chief() does this:

# from gcc5.four -O3 positive my feedback # edx= number=1 # rax= uint64_t n .L9: # bash{ lea rcx, [rax+1+rax*2] # rcx = three*n + 1 mov rdi, rax shr rdi # rdi = n>>1; trial al, 1 # fit flags primarily based connected n%2 (aka n&1) mov rax, rcx cmove rax, rdi # n= (n%2) ? three*n+1 : n/2; adhd edx, 1 # ++number; cmp rax, 1 jne .L9 #}piece(n!=1) cmp/subdivision to replace max and maxi, and past bash the adjacent n 

The interior loop is branchless, and the captious way of the loop-carried dependency concatenation is:

  • three-constituent LEA (three cycles connected Intel earlier Crystal Water)
  • cmov (2 cycles connected Haswell, 1c connected Broadwell oregon future).

Entire: 5 rhythm per iteration, latency bottleneck. Retired-of-command execution takes attention of every little thing other successful parallel with this (successful explanation: I haven’t examined with perf counters to seat if it truly runs astatine 5c/iter).

The FLAGS enter of cmov (produced by Trial) is quicker to food than the RAX enter (from LEA->MOV), truthful it’s not connected the captious way.

Likewise, the MOV->SHR that produces CMOV’s RDI enter is disconnected the captious way, due to the fact that it’s besides sooner than the LEA. MOV connected IvyBridge and future has zero latency (dealt with astatine registry-rename clip). (It inactive takes a uop, and a slot successful the pipeline, truthful it’s not escaped, conscionable zero latency). The other MOV successful the LEA dep concatenation is portion of the bottleneck connected another CPUs.

The cmp/jne is besides not portion of the captious way: it’s not loop-carried, due to the fact that power dependencies are dealt with with subdivision prediction + speculative execution, dissimilar information dependencies connected the captious way.


Beating the compiler

GCC did a beautiful bully occupation present. It may prevention 1 codification byte by utilizing inc edx alternatively of adhd edx, 1, due to the fact that cipher cares astir P4 and its mendacious-dependencies for partial-emblem-modifying directions.

It may besides prevention each the MOV directions, and the Trial: SHR units CF= the spot shifted retired, truthful we tin usage cmovc alternatively of trial / cmovz.

### Manus-optimized interpretation of what gcc does .L9: #bash{ lea rcx, [rax+1+rax*2] # rcx = three*n + 1 shr rax, 1 # n>>=1; CF = n&1 = n%2 cmovc rax, rcx # n= (n&1) ? three*n+1 : n/2; inc edx # ++number; cmp rax, 1 jne .L9 #}piece(n!=1) 

Seat @johnfound’s reply for different intelligent device: distance the CMP by branching connected SHR’s emblem consequence arsenic fine arsenic utilizing it for CMOV: zero lone if n was 1 (oregon zero) to commencement with. (SHR with number != 1 connected Nehalem oregon earlier causes a stall if you publication the emblem outcomes. That’s however they made it azygous-uop. The displacement-by-1 encoding is good, although.)

Avoiding MOV doesn’t aid with the latency astatine each connected Haswell (Tin x86’s MOV truly beryllium “escaped”? Wherefore tin’t I reproduce this astatine each?). It does aid importantly connected CPUs similar Intel pre-IvB, and AMD Bulldozer-household, wherever MOV is not zero-latency (and Crystal Water with up to date microcode). The compiler’s wasted MOV directions bash impact the captious way. BD’s analyzable-LEA and CMOV are some less latency (2c and 1c respectively), truthful it’s a greater fraction of the latency. Besides, throughput bottlenecks go an content connected BD due to the fact that it lone has 2 integer ALU pipes. Seat @johnfound’s reply, wherever helium has timing outcomes from an AMD CPU.

Equal connected Haswell, this interpretation whitethorn aid a spot by avoiding any occasional delays wherever a non-captious uop steals an execution larboard from 1 connected the captious way, delaying execution by 1 rhythm. (This is known as a assets struggle). It besides saves a registry, which whitethorn aid once doing aggregate n values successful parallel successful an interleaved loop (seat beneath).

LEA’s latency relies upon connected the addressing manner, connected Intel SnB-household CPUs earlier Crystal Water. 3c for three parts ([basal+idx+const], which takes 2 abstracted provides), however lone 1c with 2 oregon less elements (1 adhd). Any CPUs (similar Core2) bash equal a three-constituent LEA successful a azygous rhythm. Worse, SnB-household standardizes latencies: nary 2c uops, other three-constituent LEA would beryllium lone 2c similar Bulldozer. (three-constituent LEA is slower connected AMD excessively, conscionable not by arsenic overmuch).

Crystal Water improved the LEA execution models to beryllium 1c latency for each addressing modes, and four/timepiece throughput but with a scaled scale (past 2/timepiece). Alder Water / Sapphire Rapids has 2c latency for shifted-scale. (https://uops.information/). Zen three and future tally three-constituent LEAs arsenic 2 uops.

Truthful lea rcx, [rax + rax*2] / inc rcx is lone 2c latency, quicker than lea rcx, [rax + rax*2 + 1] connected Intel earlier Crystal Water. Interruption-equal connected BD and Alder Water, and worse connected Core2 and Crystal Water. It prices an other uop which frequently isn’t worthy it to prevention 1c latency, however latency is the great bottleneck present and HSW has a broad pipeline.

Neither GCC, ICC, nor Clang (connected godbolt) utilized SHR’s CF output, ever utilizing an AND oregon Trial. Foolish compilers. :P They’re large items of analyzable equipment, however a intelligent quality tin frequently bushed them connected tiny-standard issues. (Fixed hundreds to thousands and thousands of occasions longer to deliberation astir it, of class! Compilers don’t usage exhaustive algorithms to hunt for all imaginable manner to bash issues; that would return excessively agelong once optimizing a batch of inlined codification, which is what they bash champion. They besides don’t exemplary the pipeline successful the mark uarch, not successful the aforesaid item arsenic IACA oregon particularly https://uica.uops.data/; they conscionable usage any heuristics.)


Elemental loop unrolling gained’t aid; this loop bottlenecks connected the latency of a loop-carried dependency concatenation, not connected loop overhead / throughput. This means it would bash fine with hyperthreading (oregon immoderate another benignant of SMT), since the CPU has tons of clip to interleave directions from 2 threads. This would average parallelizing the loop successful chief, however that’s good due to the fact that all thread tin conscionable cheque a scope of n values and food a brace of integers arsenic a consequence.

Interleaving by manus inside a azygous thread mightiness beryllium viable, excessively. Possibly compute the series for a brace of numbers successful parallel, since all 1 lone takes a mates registers, and they tin each replace the aforesaid max / maxi. This creates much education-flat parallelism.

The device is deciding whether or not to delay till each the n values person reached 1 earlier getting different brace of beginning n values, oregon whether or not to interruption retired and acquire a fresh commencement component for conscionable 1 that reached the extremity information, with out touching the registers for the another series. Most likely it’s champion to support all concatenation running connected utile information, other you’d person to conditionally increment its antagonistic.


You may possibly equal bash this with SSE packed-comparison material to conditionally increment the antagonistic for vector parts wherever n hadn’t reached 1 but. And past to fell the equal longer latency of a SIMD conditional-increment implementation, you’d demand to support much vectors of n values ahead successful the aerial. Possibly lone worthy with 256b vector (4x uint64_t).

I deliberation the champion scheme to brand detection of a 1 “sticky” is to disguise the vector of each-ones that you adhd to increment the antagonistic. Truthful last you’ve seen a 1 successful an component, the increment-vector volition person a zero, and +=zero is a nary-op.

Untested thought for guide vectorization

# beginning with YMM0 = [ n_d, n_c, n_b, n_a ] (sixty four-spot components) # ymm4 = _mm256_set1_epi64x(1): increment vector # ymm5 = each-zeros: number vector .inner_loop: vpaddq ymm1, ymm0, xmm0 vpaddq ymm1, ymm1, xmm0 vpaddq ymm1, ymm1, set1_epi64(1) # ymm1= three*n + 1. Possibly might bash this much effectively? vpsllq ymm3, ymm0, sixty three # displacement spot 1 to the gesture spot vpsrlq ymm0, ymm0, 1 # n /= 2 # FP mix betwixt integer insns whitethorn outgo other bypass latency, however integer blends don't person 1 spot controlling a entire qword. vpblendvpd ymm0, ymm0, ymm1, ymm3 # adaptable mix managed by the gesture spot of all sixty four-spot component. I mightiness person the origin operands backwards, I ever person to expression this ahead. # ymm0 = up to date n successful all component. vpcmpeqq ymm1, ymm0, set1_epi64(1) vpandn ymm4, ymm1, ymm4 # zero retired components of ymm4 wherever the comparison was actual vpaddq ymm5, ymm5, ymm4 # number++ successful parts wherever n has ne\'er been == 1 vptest ymm4, ymm4 jnz .inner_loop # Autumn done once each the n values person reached 1 astatine any component, and our increment vector is each-zero vextracti128 ymm0, ymm5, 1 vpmaxuq .... oops, requires AVX-512 # hold doing a horizontal max till the precise extremity. However you demand any manner to evidence max and maxi. 

You tin and ought to instrumentality this with intrinsics, not manus-written asm.


Algorithmic / implementation betterment:

Too conscionable implementing the aforesaid logic with much businesslike asm, expression for methods to simplify the logic, oregon debar redundant activity. e.g. memoize to observe communal endings to sequences. Oregon equal amended, expression astatine eight trailing bits astatine erstwhile (gnasher’s reply)

@EOF factors retired that tzcnt (oregon bsf) may beryllium utilized to bash aggregate n/=2 iterations successful 1 measure. To vectorize that effectively, we most likely demand AVX-512 vplzcntq last isolating the lowest fit spot with v & (v-1)). Oregon conscionable bash aggregate scalar ns successful parallel successful antithetic integer regs.

The loop mightiness expression similar this:

goto loop_entry; // C++ structured similar the asm, for illustration lone bash { n = n*three + 1; loop_entry: displacement = _tzcnt_u64(n); n >>= displacement; number += displacement; } piece(n != 1); 

This whitethorn bash importantly less iterations, however adaptable-number shifts are dilatory connected Intel SnB-household CPUs with out BMI2. three uops, 2c latency for FLAGS, though lone 1c for the existent information. (number=zero means the flags are unmodified. They grip this arsenic a information dependency, and return aggregate uops due to the fact that a uop tin lone person 2 inputs (pre-HSW/BDW anyhow)). This is the benignant of happening that group complaining astir x86’s brainsick-CISC plan are referring to. It makes x86 CPUs slower than they would beryllium if the ISA was designed from scratch present, equal successful a largely-akin manner. (i.e. this is portion of the “x86 taxation” of velocity / powerfulness outgo.) BMI2 SHRX/SHLX/SARX are 1 uop / 1c latency.

It besides places tzcnt (3c connected Haswell and future) connected the captious way, truthful it importantly lengthens the entire latency of the loop-carried dependency concatenation. It does distance immoderate demand for a CMOV, oregon for making ready a registry holding n>>1, although. @Veedrac’s reply overcomes each this by deferring the tzcnt/displacement for aggregate iterations, which is extremely effectual (seat beneath).

We tin safely usage BSF oregon TZCNT interchangeably, due to the fact that n tin ne\’er beryllium zero astatine that component. TZCNT’s device-codification decodes arsenic BSF connected CPUs that don’t activity BMI1. (Meaningless prefixes are ignored, truthful REP BSF runs arsenic BSF).

TZCNT performs overmuch amended than BSF connected AMD CPUs that activity it, truthful it tin beryllium a bully thought to usage REP BSF, equal if you don’t attention astir mounting ZF if the enter is zero instead than the output. Any compilers bash this once you usage __builtin_ctzll equal with -mno-bmi.

They execute the aforesaid connected Intel CPUs, truthful conscionable prevention the byte if that’s each that issues. TZCNT connected Intel (pre-Skylake) inactive has a mendacious-dependency connected the supposedly compose-lone output operand, conscionable similar BSF, to activity the undocumented behaviour that BSF with enter = zero leaves its vacation spot unmodified. Truthful you demand to activity about that except optimizing lone for Skylake, truthful location’s thing to addition from the other REP byte. (Intel frequently goes supra and past what the x86 ISA handbook requires, to debar breaking wide-utilized codification that relies upon connected thing it shouldn’t, oregon that is retroactively disallowed. e.g. Home windows 9x’s assumes nary speculative prefetching of TLB entries, which was harmless once the codification was written, earlier Intel up to date the TLB direction guidelines.)

Anyhow, LZCNT/TZCNT connected Haswell person the aforesaid mendacious dep arsenic POPCNT: seat this Q&A. This is wherefore successful gcc’s asm output for @Veedrac’s codification, you seat it breaking the dep concatenation with xor-zeroing connected the registry it’s astir to usage arsenic TZCNT’s vacation spot once it doesn’t usage dst=src. Since TZCNT/LZCNT/POPCNT ne\’er permission their vacation spot undefined oregon unmodified, this mendacious dependency connected the output connected Intel CPUs is a show bug / regulation. Presumably it’s worthy any transistors / powerfulness to person them behave similar another uops that spell to the aforesaid execution part. The lone perf upside is action with different uarch regulation: they tin micro-fuse a representation operand with an listed addressing manner connected Haswell, however connected Skylake wherever Intel eliminated the mendacious dep for LZCNT/TZCNT they “un-laminate” listed addressing modes piece POPCNT tin inactive micro-fuse immoderate addr manner.


Enhancements to concepts / codification from another solutions:

@hidefromkgb’s reply has a good reflection that you’re assured to beryllium capable to bash 1 correct displacement last a 3n+1. You tin compute this much equal much effectively than conscionable leaving retired the checks betwixt steps. The asm implementation successful that reply is breached, although (it relies upon connected OF, which is undefined last SHRD with a number > 1), and dilatory: ROR rdi,2 is sooner than SHRD rdi,rdi,2, and utilizing 2 CMOV directions connected the captious way is slower than an other Trial that tin tally successful parallel.

I option tidied / improved C (which guides the compiler to food amended asm), and examined+running quicker asm (successful feedback beneath the C) ahead connected Godbolt: seat the nexus successful @hidefromkgb’s reply. (This reply deed the 30k char bounds from the ample Godbolt URLs, however shortlinks tin rot and had been excessively agelong for goo.gl anyhow.)

Besides improved the output-printing to person to a drawstring and brand 1 compose() alternatively of penning 1 char astatine a clip. This minimizes contact connected timing the entire programme with perf stat ./collatz (to evidence show counters), and I de-obfuscated any of the non-captious asm.


@Veedrac’s codification

I bought a insignificant speedup from correct-shifting arsenic overmuch arsenic we cognize wants doing, and checking to proceed the loop. From 7.5s for bounds=1e8 behind to 7.275s, connected Core2Duo (Merom), with an unroll cause of sixteen.

codification + feedback connected Godbolt. Don’t usage this interpretation with clang; it does thing foolish with the defer-loop. Utilizing a tmp antagonistic ok and past including it to number future modifications what clang does, however that somewhat hurts gcc.

Seat treatment successful feedback: Veedrac’s codification is fantabulous connected CPUs with BMI1 (i.e. not Celeron/Pentium)