PENCIL | Zeyu Bian

← Back to deep learning reading group

Key points for presentation:

Definition of the PENCIL method
For the experiments:
- Explain the three types of problems, SAT, QBF and Einstein’s puzzle
- Summarize experimental findings
- How to train the model to follow PENCIL?
  - Data preparation? loss?
  - What’s the evaluation metric?
  - What’s the difference between training for PENCIL and training for CoT?
For the theory:
- Definition of Turing machines and related background about computation theory
  - What’s Exponential Time Hypothesis?
- Definition of autoregressive machine and state function
  - The mechanism of summarization
- The FASP programming language
  - How to implement “simulation”, “summarization”, and “reduction trigger” with FASP?
  - Correspondence between FASP and transformer weights

Talk by Zhiyuan: https://simons.berkeley.edu/talks/zhiyuan-li-ttic-2025-02-20
Theoretical papers about expressiveness of transformers in terms of computation

Minor Issues and Question

(Zeyu) How to define “C”, “T”, “A” when multiple special tokens appeared in context? (Could ignore the confusing intermediate special tokens until a clear C [Call] T [Sep] A [Return] pattern emerge.)
1. [Tianhao] The generation process always starts from scratch, so there won’t be such confusing cases? It’s basically a stack of such calls.
2. [Zhi] I guess by construction they won’t be nested?
(Zeyu) Do the SAT/QBF/Einstein’s question only have one solution? (If not, do the model need to learn “reflection” for solving the entire puzzle or just by PENCIL.)
1. [Tianhao] I think there is only one solution.
(Zhi) Not sure if trace rate is a good evaluation metric, since a crafted reasoning ground truth might not be the best thing to evaluated against. (p9)
1. [Tianhao] +1. This makes sense when the crafted reasoning path is unavoidable, i.e. there are no shortcuts. I guess that’s why they consider these three types of problems. In the worst case, it requires to traverse the whole reasoning graph.
(Zeyu) How is the scaffolded CoT generated by DPLL and mentioned procedure? It seems to me the standard DPLL just output the final answer. Does it mean the training dataset is human annotated according to DPLL logic?
(Binghua) I’m not sure if the proof in Section 5 is equivalent to the following statement: In a stack-based structure, the order of return operations equals to the order of the total number of nodes in the hierarchical tree. Additionally, the size of the stack (or hierarchical tree) grows as order $2^T$ while the number of the neighbors of the root is $S$.
(Mengzhe) How to compute $d_i$ if $i \geq \text{max_pos}(\sigma’) - \text{min_pos}(\sigma’) + 2 \land p = \text{max_pos}(\sigma’) + 1$
- (Zeyu) Good question, actually the machine just stop at $\text{max_pos}(\sigma’) - \text{min_pos}(\sigma’) + 1$ in the case $p = \text{max_pos}(\sigma’) + 1$ since $n = \text{max_pos}(\sigma’) - \text{min_pos}(\sigma’) + 1$. Maybe the overall thought is to let the machine move right until $\text{max_pos}(\sigma’) - \text{min_pos}(\sigma’) + 1$, then move left to track $p$ if $p \leq \text{max_pos}(\sigma’) - 1$.
- (Mengzhe) Thank you very much. I got it.
(Mengzhe) This is the definition of local closed operators. How to identify whether an operator is local or non-local. I am not sure about the meaning of local.
- (Zeyu) The local operator definition is as follows:
- I think you need to justify whether it could be expressed in the form of local representation $\phi_\omega$. Therefore, I think maybe the operators induced by aha (or in the form of moving average) are not local.
(Zeyu) Could we always find such $t_i$ to satisfy (86)?
(Zeyu) Typo, should be $(t_i - t_{i-1}) + s \circ f_{\pi}^{t_{i-1}} \geq 2 s \circ f_{\pi}^{t_i} $
(Zeyu) Shouldn’t current_sim_pos subtract cumulated get_move?
- Also, what does “exactly matches” mean?

Main Comments and Questions

(Zhi) Is PENCIL generalizable to other types of reasoning tasks?
1. data preparation pipeline (the algorithmic generation process) only for specific constraint satisfaction problems: Can we still manually craft such a dataset when the task doesn’t have a constraint satisfaction structure? Maybe adding a structure score with [CALL] [SEP] [RETURN] in the RL process similarly to how the [reason] is introduced?
2. meta: how big is constraint satisfaction problems in the space of all reasoning tasks? Is it big enough such that models tuned on these tasks generalize to other reasoning tasks?
3. e.g. How does PENCIL perform on AIME or MATH? How to preprocess data for math problems? (I assume it’s not easily algorithmic like DPLL, maybe a lot of human effort is needed.)
4. In general “letting the model figure it out” tends to scale better and has better generalization capability, so can we use a “reduction model” for data curation and reduction at inference time instead of reduction rules and algorithmic generation of SFT dataset?
(Zhi) Taking a step back: can we incentives the model to produce shorter CoT at some stage in some way to reduce the overly long and wasteful CoT? Strategic context pruning?
1. [Tianhao] +1. I was thinking about adding length penalty with annealing. When the length penalty is large, the model might make more mistakes, and we can gradually decay the length penalty to 0 so that the model can(?) find the correct answer with the shortest response length. This is somehow a version of regularization path in optimization problems, i.e. gradually decaying the regularizer coefficient to 0.
2. [Zhi] Or we can penalize the model more severely when the answer is wrong, because the CoT tends to be longer when the final answer is wrong. Some paper attributed this to the inherent bias of GRPO to have longer reasoning for wrong cases.
[Tianhao] What’s the meaning of “test-time scalability”? Is the model trained on problems with smaller $n$ and then tested on problems of larger size than those of the training data? What’s $n$ for the training data?
1. (Zeyu) The $n$ for training data is not clear from the paper, could we infer the training data and testing data share the same $n$ from the sentence “We evaluate on a held-out validation set of 100 problem instances using two metrics”?
2. [Zhi] My views:
  1. given a fixed inference time, the model can achieve 95% acc on some classes of problems, the maximum complexity or size (e.g. in SAT) among these 95%-solvable-with-budget problems measures the test-time scalability
  2. it is trained and tested on problems within the same range of sizes. So the word scalability should be with respect to inference time, instead of problem size
[Tianhao] For the results in Table 1, the CoT method degrades sharply for $n > 7$. Is this because the distribution shift between the problem size in the training data and the size of the test problems? For $n > 7$, do the results suggest that the CoT method has bad out-of-distribution generalization? While for the PENCIL method, is $n \in [7, 10]$ considered as in-distribution or out-of-distribution?
[Tianhao] For results on PENCIL for the SAT problem in Table 1, it seems that the accuracy is still very high even when the trace rate is only 83%. What’s happening for these instances with relatively low trace rate?
[Tianhao] Why is the accuracy for QBF systematically higher than that of SAT?
[Binghua] Can we develop a hierarchical tree structure to store the answer instead of using a stack structure? In a stack-based approach, we might lose information if we realize that we have over-reduced. It could be beneficial to train a scoring mechanism to estimate the degree of “weirdness” or “uncertainty” in the next-token predictor. If the predictor detects high uncertainty, we can provide more nodes containing answer-related information to the next-token generator. Conversely, if the predictor is confident in its answer, we reduce the number of nodes it receives.
[Binghua] Can we further improve our reduction method? In PENCIL, it appears that we simply remove the T part during reduction. However, an alternative approach could involve compressing T instead of entirely discarding it (using text compression related skills). Additionally, a parameter could be introduced to control the extent to which we penalize the length of T.
[Tianhao] How is the training loss calculated?

Ideas and Thoughts for Future Direction

[Tianhao] Incentivize the model to reduce CoT length. We might need some form of curriculum learning.
1. Also, it’s unclear if it’s beneficial to reduce CoT length or the length of the final proof in math problems.
[Tianhao] Can we simultaneously train two models to automatically realize the PENCIL-style approach? One model for generation, and the other for reduction.

Meeting Notes

1. Discussion regarding the PENCIL methodology

Generalization to other tasks. The current reasoning paths seem specifically crafted for the tasks.
- PENCIL for math?
- Difficulty in data preparation
- For inference time, can we directly apply PENCIL?
  - Not sure how to automatically do this
- How to train the model to have behavior such as proposing and proving a lemma
  - No for current models? (need to check this)
Future directions
- Incentivize the model to reduce CoT length
  - First train for reasoning capability, then reduce the length
  - It seems critical how to design the reward, to balance different targets
    - Length penalty (check literature)
    - Also, penalty for hallucination (reference for Anthropic’s paper @Zeyu Bian)
      - On the Biology of a Large Language Model: Entity Recognition and Hallucinations
- Simultaneously train two models: One model for generation, and the other for summarization.
  - Maybe directly use GPT4 or something similar, without training the second model
    - What’s the training data for the first model?
      - Start with a small amount of human-labelled data
      - Use LLM to segment and summarize the existing CoT data
      - General summarization pattern?
  - Maybe even directly combine a model trained for full CoT with GPT
    - Ablation study: Can existing model benefit from summarization?
  - Check literature to see if this has been done
Alternate approach: hierarchical tree structure instead of the stack-based approach.
- How to construct dataset?
Some intermediate thoughts might still be useful in the future.
- PENCIL + retrieval augmented generation?
- Also depends on problem difficulty

2. Discussion regarding the experiments

How the loss is calculated
- The loss is only calculated for generated tokens up to the [RETURN] token
For the three examples, how are the reasoning paths generated, and are they unique?
- DPLL
Trace rate as an evaluation metric
- Why is this a good evaluation metric?
- What happens when the trace rate is low but accuracy is high?
Test-time scalability

3. Discussion regarding the theory of PENCIL

A high-level understanding of the theoretical results and proof

Other Related Papers

Minor Issues and Question

Main Comments and Questions

Ideas and Thoughts for Future Direction

Meeting Notes

1. Discussion regarding the PENCIL methodology

2. Discussion regarding the experiments

3. Discussion regarding the theory of PENCIL