AI Archives - Blog IT

How we use agentic coding tools in our favor – Copilot

David Pereira — Mon, 15 Jun 2026 08:24:31 +0000

Introduction
Planning the experiment
- Cost considerations
Testing Copilot coding agent
- Code review comments
- Improvements mid-way
- Our pain points
- Performance considerations
Main takeaways
Resources
Conclusion

Introduction

This blog post is part of a series where I share how AI is augmenting my work, and what I’m learning from it. If you’re interested, you can read the second post here: Lessons learned improving code reviews with AI. In that post, I reference how we are adopting AI for code reviews. This one is a deep dive into how we experimented with GitHub Copilot coding agent, to see how it could fit our team’s needs. Truth be told, we are using Claude Code a lot more, it’s the tool we are focused on and have adopted, but that will be a separate blog post in this series 😆.

Our approach was simple: experiment in order to learn what works. We are still learning and improving, but the more we use and optimize these tools, the more leverage we gain as a team. Let’s get into the details.

Planning the experiment

Okay… we all have heard of Copilot coding agent by now. You probably have heard that at GitHub, this agent is the number 1 contributor in their code base at the keynote and their roadmap webinar Q1 2026

Well… to me this is the same statement that Dario, CEO of Anthropic, made of “in 3-6 months AI is writing 90% of the code”. I’m glad it works for them, and they can spend their marketing budget and strategy on these slides and statements. But it’s not the metric I care about. I’m fine not having to type in a keyboard to write 90% of the code, but measuring LOC just doesn’t make sense to me. That says nothing about the quality of the merged code, bugs introduced, etc. They are hype-driven statements in my opinion 🤣. Nevertheless, it leaves this question in my mind: could coding agents work effectively and produce high-quality PRs?

We have been doing quite a bit of experimenting with GitHub Copilot coding agent and Claude Code, to try to answer this question. Maybe we can replace most of our typing on a keyboard to prompting. Our goal is nothing like GitHub, we have nothing to sell… they do 😅. Our motivation is to keep improving the way we work and bring value to real customers. So I’ll share what we have done and experimented with GitHub Copilot coding agent 🙂.

We planned this before the change to usage-based billing, since I had still 70% premium tokens left in August, and they reset every month, we were like:

Why not spend them all in a ton of coding agent experiments 😄 ! So we did 😆. Our approach was this:

Pick 1 task that is prioritized for our next release to give to the coding agent
Pick 1 other backlog item or general task we would like to be done, some bugfixes, or code improvements
Assign all tasks to a coding agent (most would be Copilot, others would go to Claude third-party agent)
Go do another task, then after a while, review PRs
Report on the premium token usage + number of PRs + quality of the output + number of comments
Start the cycle again with more tasks on the next month

Again, our goal is to experiment in order to learn what works. We did this experiment in August 2025, some other months and again in March 2026 (mainly since there were many improvements introduced). It’s important to note our focus was on Copilot, not any other third-party agent. We did not use Codex, and only used Claude on some of the tasks for this experiment.

Cost considerations

We did not analyze or think a lot about costs. The goal was to experiment and see the quality of the PRs on different tasks. But suffice to say, the billing change is a necessary change. We were able to hit the 59min timeout on tasks that should not cost 1 premium request, or get a lot of tool calls for that cost, like:

Prompt	Branch changes against master	Output
please check all work that is done here in this branch, vs the master branch, and do a thorough code review using all skills available. Focus on bugs and then code quality too. Use multiple subagents, each with their own perspective and goal	~47,300 additions and ~11,000 deletions	Error hitting 59min timeout. Used 4 subagents. We hit the error `model_max_prompt_tokens_exceeded` with the message “prompt token count of 530706 exceeds the limit of 64000”
improve memory consumption of function X. Acceptance criteria: Memory should not exceed Y, regardless of the amount of items being processed.	~900 additions and ~40 deletions	Success in 53min

I’m sure you have seen plenty of engineers using a lot of inference for 1 premium request or for 20 dollars of a Claude Code subscription 😅. Anyway, if we were to adopt Copilot Coding agent in June, we would need more controls to control GitHub actions minutes and overall token usage + re-evaluate the cost-benefit.

Testing Copilot coding agent

I’ll share an approximation of the results we got:

Author	PRs	Merged	Merge rate
Copilot coding agent	~130	~30	~23%
Third-party Claude agent	~5	0	0%
Human developers	~400	~390	~98%

Size (lines changed)	Copilot/Claude PRs
S (10-49)	~12
M (50-199)	~48
L (200-999)	~55
XL (1000+)	~20

Again, this is mainly an experiment so merge rate is expected to be very low. A lot of Copilot coding agent PRs were spikes/exploration or using GitHub custom agents to analyze PRDs, do security reviews, etc.

Let’s talk about the quality of the PR when Copilot asked me for a review. First and foremost, I don’t think we have a high bar for quality PRs compared to other successful software teams. To me, a high-quality PR is expected, always, period. Second of all, many draft PRs I’ve created and seen other engineers create, are usually a v0. It’s a version we publish to get feedback from engineers on our team, it’s never actually ready to be merged. All Copilot PRs are created as drafts, to me this signals Copilot really just did a v0, even if it says it has completed everything and everything works. My current opinion is this is made on purpose, to give you a chance to steer Copilot again in its implementation, and do an initial code review to spot things that are wrong.

I’ve not seen any docs, or official statements from GitHub supporting my claim. With that said, I’d like an option to enable Copilot to continue iterating on their PR and only ask me for a review when it’s no longer a draft. But this coding agent might not evolve in that direction since their marketing and docs so far are focused on small & medium tasks. Cost control becomes very important too with long-running agents.

To simplify things, we’ll say asking me for a review is the same as another engineer asking me to review their PR. Any engineer in our team (or generally in the world) only assigns a co-worker a PR for review once the PR is ready, has finished work and they tested and reviewed their own work. From our experiment so far, Copilot was not able to have a ready and polished PR in most PRs, so I need to leave a lot of comments saying the countless wrong parts. One of the problems is the feedback loop, we didn’t make the Playwright MCP work for us since we have limitations on the front-end login flow. So the agent doesn’t deliver the necessary code for front-end tasks.

Code review comments

In terms of the comments I made, alongside CCR and Claude Code before marking the PR as ready for review, it’s around 20-30. The ones that got merged usually had under 20 comments and most discussions weren’t critical or about low-quality code. Again, it appears to me these were mostly low-to-medium tasks that were clearly defined and the agent did well. Our closed PRs and a few of the experiments got 40+ comments, for various reasons:

Unnecessary test cases
Re-implemention of certain modules and functions was necessary
Removed existing functionality
Low quality code and not adhering to coding standards and best practices (e.g. a lot of duplicated code, missing error handling)
Missing front-end implementation
Usage of non-existent CSS classes

The ones that got closed and were simple experiments didn’t receive much review. I understand that isn’t a great thing for the experiment, but we simply invested more time in some PRs rather than all, again some are just spikes or over the top on purpose.

It’s not the same thing as our own team PRs, of course, since these draft PRs are done in like ~15min. But the number of comments necessary to have these draft PRs ready to be reviewed by another human (and AI tools, like Copilot itself, Claude and CodeRabbit) is important. Since it’s time I’m spending reviewing code. I don’t want to be bothered when there are still typos and acceptance criteria is not fully met 😅.

Improvements mid-way

I had features Copilot coding agent didn’t do very well, which then prompted me to ask for a way to ask clarifying questions. The agent dashboard is nowadays a lot better, and we can start with this type of planning and make it ask clarifying questions too. I only experimented with this a few times, mostly because we started to add more context and details in the GitHub issues, and because steering costs more premium requests. This also matches Cursor’s best practices of “plan before coding”, a best practice that is mentioned everywhere and by all AI labs for good reason.

After some PRs, we would also try to tweak the instructions.md to see if it improves anything. It’s a bit hard to know for sure if some changes to our prompts/instructions really improve the LLM’s quality. Just by experimenting and tweaking, can we really see if in the future PRs it works better. We also didn’t configure copilot-setup-steps.yml. We know the max timeout for the coding agent is 59 minutes currently. There weren’t many options we wanted to configure in this file for our experiment.

GitHub also shipped the ability for the coding agent to use Copilot code review and it runs CodeQL as well. Which is great, some of our pain points were kind of addressed here, since it prevents some issues from reaching a human reviewer. Still… we had issues and opinions on the PRs we experimented and saw, so let’s go through them now 🙂.

Our pain points

Tests

There are several times when Copilot didn’t run all unit tests. Or Copilot says “tests pass”, when in fact it didn’t wait for all tests to finish so it can’t know if tests pass… Here is an example of a comment I left Copilot after I reviewed the PR:

"copilot" there are several issues and missing implementation. Please make all the following changes:

## Front end
- **Missing** the entire front-end implementation, please make the necessary changes using the design system and with the acceptance criteria in the GitHub issue

## Testing
- Please please follow coding standards on all methods
- You should include unit tests to your implementation of X
- Delete all assertions of the `exception.Message`, because it's something that can change, and that makes it a fragile test

I read some session logs and found interesting things. Sure, I didn’t specify a lot about what unit tests to run in the prompt, but I’d actually prefer running all unit tests since we can make changes that break other areas in our codebase. But, in the end, it also didn’t wait for tests to run and see if everything passes. I don’t expect to see the wording “tests pass” if the agent simply didn’t wait for them to finish. Honestly, this is not that bad, we can run them ourselves or later in our CI check… but again I want to refine our instructions file in order for coding agents to always follow them and produce better quality PRs. Instruction following depends on the LLM, but there are still improvements here for sure.

Doesn’t follow our PR template

It just doesn’t follow our PR template. Sure, it’s a small thing, maybe a temporary limitation. But I mean in general, whenever Copilot publishes a comment on the PR saying “Fixed! This is done….”, but then I see that it’s not done and the PR description is something like this:

## Definition of Done
- [x] PR follows template format 
- [x] Code review comments addressed
- [x] Implementation follows C# coding standards  
- [x] Build warnings fixed
- [x] Core functionality implemented
- [ ] Final Application compilation issues resolved
- [ ] All tests passing

Every time, I’m like:

Lack of docs on installed tools and observability

This is a nitpick I know, but since I was reading the session logs I found the copilot coding agent has access to python3. I didn’t know this was the case from Copilot Coding agent docs, but it makes sense since our GitHub actions runner uses ubuntu. I mean we have the firewall on, but it would be great to know how to deny access to these tools. I’m also not about to dunk too hard on GitHub about the observability around this feature, because they rely on GitHub actions. We all know what telemetry we can get out of those… From an engineering perspective, Copilot Coding Agent lacks a lot, I mean a lot, when it comes to observability. No OpenTelemetry, no nothing. It’s clearly not a priority or concern for them, I can understand that, but I don’t agree with that decision. Claude Managed Agents has some stuff like tracing, but I guess not a lot of companies have observability as a priority or concern for cloud agents.

No reasoning around a simpler solution

We saw a few PRs where the agent simply jumps and fixates on the first solution, without reasoning about the trade-offs and alternatives there are. We’ll dive deeper about one scenario concerning performance considerations, but for now I’ll keep it light. In one task where we assigned Claude agent, the bug we wanted to fix is dead simple. It’s one function, a string extension method, that is not handling edge cases correctly when parsing class names. The solution in the PR was using StringBuilder and a for loop with some logic to decide how to parse and handle the edge case. It’s not wrong, but I prefer simpler code. Sure, with an initial prompt that says something like “don’t forget to code review your solution at the end”, perhaps it would have caught and reasoned about if there were simpler solutions, using Regex for example. Maybe the Claude third-party agent can’t do that, only Copilot coding agent can, no idea though.

Reliability problems

We experienced errors sometimes, or hit unfortunate limitations or bugs. Of the ~130 Copilot PRs, we got around 30 failed GitHub actions runs. Due to various errors but sometimes I can’t even know why, for example, when the session fails I can’t always see the full logs in that job run. The GitHub actions UI only shows “This job failed” with the annotation “Unhandled exception. System.IO.IOException: No space left on device”. Well… great, thanks for the info. Couldn’t you truncate or do something to reliably show me some verbose logs? What contributed most to disk space? Is the agent getting too much output in tool calls that is saved in files on disk? What tools produced the most output tokens? Did the agent make tool calls that are inefficient and wrong? What happened exactly? Not the best UX… Sure, there are larger GitHub actions runners. But I don’t want to throw money at a problem I don’t know the root cause to…

Some sessions we hit the 59min timeout, but I feel like we shouldn’t. One copilot coding agent session was about code review on a branch with this prompt: “please check all work that is done here in this branch, vs the master branch and do a thorough code review using all skills available. Focus on bugs and then code quality too. Use multiple subagents each with their own perspective and goal”. I wasn’t expecting a 59min run even with 5 subagents, then I saw this on the logs:

20:25:43.4654572Z Start flushing callbacks
20:53:35.0823201Z ::***::
20:53:40.0920377Z ##[error]The operation was canceled.

What is this? Why did the actions runner take ~30min to flush callbacks 😅. The code review was done already, I don’t understand why it failed the whole job, so it’s a bit frustrating to spend these Actions minutes…

Also, we assigned Copilot to an issue and immediately got a comment saying “The agent encountered an error and was unable to start working on this issue: This may be caused by a repository ruleset violation. See granting bypass permissions for the agent, or please contact support if the issue persists. (Request id: X).” Well… no, i know for a fact it’s not a ruleset violation or permissions related. I assigned the Claude agent next to this issue and it worked. Just to test it again, I assigned Copilot again after some days to this issue. It started working and made a WIP PR, until it failed with this error:

Then I see the agent session logs and find these type of errors:

stderr: "fatal: path 'Tests/SuperCoolTests.cs' exists on disk, but not in 'X'\n"
(...)
[cca-engine] Failed to disconnect session X cleanly: Error: Connection is disposed.
[cca-engine] Fatal: Error: Execution failed: CAPIError: 422 422 Unprocessable Entity

The fatal error “exists on disk” happened 6 times during the code review step with CodeQL. I mean, is it the root cause of the 422 error… maybe not, I don’t know. Sure, these might be “nitpicks” but in a way, it says something about the reliability of this feature. Reliability is something that users can cancel subscriptions for, if they find alternatives that are reliable and fit their needs, for example.

Performance considerations

One of the more interesting things the Copilot coding agent did, that I didn’t like, was on one PR to refactor this:

var tasks = items.Select(item => ProcessItemAsync(item));
await Task.WhenAll(tasks);

Into this:

foreach (var item in items)
{
    await ProcessItemAsync(item);
}

I read the session log to understand why it did this, because in this case, this is not the best solution. For context, this PR is supposed to fix a concurrency issue due to how we use OpenConnectionAsync in the DbContext. The code throws the exception System.InvalidOperationException: The connection was not closed. The connection's current state is connecting. Also, a single DbContext isn’t thread-safe, so concurrent calls through Task.WhenAll will throw this exception. I don’t want sequential processing just to fix this concurrency issue. So here is the snippet of the agent’s internal reasoning in the session logs:

"The real problem surfaces in X.cs, where multiple concurrent calls to Y are fired simultaneously for different IDs. Since they all share the same DbContext and connection, this concurrent access causes contention on a single connection that's never properly closed. The fix requires addressing both the connection lifecycle and the concurrency issue. While adding try/finally blocks to close connections in the repository methods would help, the underlying problem is that a single DbContext isn't thread-safe, so concurrent calls through Task.WhenAll are fundamentally problematic. The most practical immediate solution is to ensure connections are properly closed after each repository method completes, which should at least prevent the connection exhaustion that's currently happening. that open connections
2. In X.cs, change from concurrent Task.WhenAll to sequential processing to avoid concurrent use of the same DbContext"

It just seems it was focused on fixing the concurrency issue, not considering the performance trade-off of this solution. Actually, it does seem some other PRs the coding agent creates has this behaviour as well. Jumping and fixating on a solution, without reasoning about the trade-offs and alternatives there are (this was a Claude Opus 4.6 copilot coding agent session by the way 😅). I mean… should we have given the context of how big this items list can be in production? Should we ask explicitly to consider the performance of its code? Well, maybe, but I was hoping Opus 4.6 could think like an engineer 😅. Either way, maybe a better feedback loop would help the agent as well, like performance tests that the agent can run after making this fix. That way, the agent could measure the cost of the code it’s making. I guess that is what we should all do anyway. Improve these feedback loops since they help agents and humans do better work. Sounds obvious, but it could probably really have improved this PR 😅.

We ended up refactoring this code in this PR because it doesn’t even make sense to process this list with a Task.WhenAll(tasks) when we can make a better DB query that is more performant and cleaner.

Main takeaways

So did Copilot coding agent do a good job? Well, I’d argue it could have been better, so I’m curious to see if we can give even better instructions and provide more context. Including some context that can be useful for Copilot directly in the issue description is always a good idea, like relevant files to skip some of the searching and grepping. Also, I acknowledge our feedback loop could be better and something that would help the coding agent for sure. Also… mister LLM was not running all tests and waiting for them to finish. So they could be failing, but it was all good for the coding agent… well, not good enough for me 😅. What I really don’t want is for the coding agent to say in the end “All tests pass”, and then I check the full logs and see “(…) ok tests are taking too long to build and run. I’ll proceed with the other tasks.”

The workflow of giving work to the agent, then go do something else entirely, and comeback to review worked well. Especially since the copilot sessions take like ~15min, so I enjoy having the agent work in the background instead of having it on my VS Code, waiting for me to approve commands or provide feedback. If I can steer it in the right direction from the start, it tends to do a decent job for the initial PR. The challenge is reducing the number of iterations in a PR until it’s considered done. Having them work in the background can increase the feedback loop of: getting code -> reviewing code -> asking for revisions.

However, delegating a task to an autonomous cloud agent and reviewing big PRs at the end is a fundamentally different workflow from iterative, step-by-step collaboration (e.g. VS Code Agent mode, or CLI). Sure, it’s cool to delegate some PRs at the end of the day, and then come back tomorrow to review that code. But it’s not very practical unless the quality of that PR is high or the task is very small in scope. I see a lot of engineers in the industry enjoying cloud agents a lot, but for me, I still prefer coding agents running locally with an iterative back-and-forth collaboration, then create a PR from that (plus I can gather more telemetry locally 🙂). Like Stephen Toub said in his blog post, iteration is expected:

If you expect CCA to get it right the first time with zero human involvement, you’ll be disappointed a non-trivial percentage of the time. Expect multiple rounds of review feedback with you providing clear, specific, and actionable feedback.

I just prefer to do it locally. The one scenario I truly prefer autonomous cloud coding agents is when I’m on-call doing monitoring and SRE type work. Handling support tickets, checking logs, dashboards, exceptions and possible improvements to our runbooks and overall codebase. I can assign tasks like bug fixing or test coverage gaps to the cloud agent, go back to Grafana. An hour later I come back and assign Copilot Code review and Claude, then go back to monitoring. When reviews are done I tell the cloud agent to fix all bugs or issues. The next day before my SRE type work I can do some code review on a PR that is in a better state, test it myself and go on from there. Cloud coding agents fit the workflow of delegating these tasks and it works well for me.

So in short, for the clear well-defined tasks the agent produced a good quality PR that got merged sometimes. For the complex features and bug fixes, that require searching and understanding many files in the codebase, it does a worse job. If the task is complex, it will require more thinking, reading multiple projects and just raw domain knowledge. It still provided value in the PRs we experimented on, since a lot of the tasks we experimented on were indeed medium complexity. Honestly, we have other tools that produce good quality PRs for clear well-defined tasks.

Resources

Conclusion

We will keep experimenting a little with GitHub Copilot coding agent or other agentic tools the Copilot subscription supports (e.g. OpenCode, Codex). But it’s fair to say we’ll be doubling down on our adoption of Claude Code as our agentic coding tool. Like I’ve said in the posts of this series, the Jagged Frontier keeps moving and knowing where the task you give these tools falls inside the frontier or not, defines how much you are augmented. If we can get more of the low-medium complexity tasks done right, reliably and ensure quality along the way. I’m certain we will be very happy and continue working on more complex tasks that provide value to our customers. Since I have seen LLMs lacking the judgement, trade-off analysis and decision making engineers have, I prefer the collaboration I can have from local sessions and not a cloud agent session.

Don’t forget to stay critical and don’t let yourself be swayed by all this hype. Test things yourself, don’t over-trust outputs from a tool, come up with your own solutions and adopt what works.

My next blog post in this series will also be about agentic coding tools, in this case, Claude Code! Are you using AI coding agents? I’d love to hear from you what your experience has been. Leave a comment and let’s chat 🙂 .

The post How we use agentic coding tools in our favor – Copilot appeared first on Blog IT.

Lessons learned improving code reviews with AI

David Pereira — Fri, 09 Jan 2026 12:44:41 +0000

Introduction
Why we started experimenting
Our AI code review journey
- Claude Code
- Saving learnings in memory
- GitHub Copilot
- CodeRabbit and Qodo
Tool of choice
- Improving multi-agent collaboration
Resources
Conclusion

Introduction

I have loved code reviews for years now, and still to this day, I love seeing good open source PRs! When I say good, I mean really great! We have access to tons of open source code, and the greatest PRs are the ones where you can learn a lot from on how to do it right. In a sense, this blog post is about just that. This blog post is part of a series where I share how AI is augmenting my work, and what I’m learning from it. If you’re interested, you can read the first post here: Becoming augmented by AI. In that post, I reference how AI has augmented me with an “initial code review”, but now I’ll go deeper into this topic. I’ll share our hands-on experience: what works, what doesn’t, and a healthy dose of my opinions along the way 😄.

Quick disclaimer: what works for us might not work for you. Your team and coding guidelines are different, and that’s fine. These are just our honest experiences.

With that said, let’s dive into why we started incorporating AI tools in our code review process.

Why we started experimenting

I recently watched this amazing video by CodeRabbit. In our team, code review isn’t really the bottleneck (yet), but it’s funny because we are also using AI heavily for feature development and trying to improve… hummm “velocity” 🤣.

Anyway, I understand many teams nowadays have increased the number of PRs created. That some PRs simply get a blind LGTM.

Maybe some PRs just have increasingly more AI slop… which wears down senior engineers tasked to do code review 😅. Not all professionals would want to do it right or maybe they just want to ship because their company’s “productivity metrics” incentivize merging more and more PRs 😅. Honestly, it’s our job to deliver code we have proven to work, I fully agree with Simon Willison. Throwing slop over to the engineers that do code review is unprofessional, just as much as throwing untested features over to QA 😐. In our case, we changed to having a dedicated dev responsible for all code reviews, and we don’t have that many per day. We simply wanted to improve code quality and reduce bugs, while keeping code review as an educational process for junior engineers.

About five months ago, our team started experimenting with AI tools, GitHub Copilot, Claude Code, Codacy, Qodo, and CodeRabbit to see how they could help us improve our review process without adding a ton of noise. There are more tools we didn’t try, like Augment Code and Greptile (has some cool benchmarks), but hopefully the lessons we learned will be useful to you either way.

Our AI code review journey

We already talked in the last post about our custom instructions, to some extent. Specifically for code review we took a phased approach and started comparing different tools:

Started with GitHub Copilot Code Review
Integrated Claude Code with GitHub and started comparing code reviews from both tools
Added CodeRabbit, Qodo and Codacy to spot differences between them
Refined prompts/instructions/configs for some tools

We didn’t invest equal time in all of them, though. Copilot and Claude ended up getting most of our attention, especially since we started using Copilot Code Review (CCR) when it was in public preview. Overall, we experimented with these tools in 30+ PRs, and made 20+ PRs to refine our prompts/instructions/agents.

Claude Code

Let’s go through Claude Code first. Here is a snippet of our code-review Claude Code custom slash command:

---
allowed-tools: Bash(dotnet test), Read, Glob, Grep, LS, Task, Explore, mcp.....
description: Perform a comprehensive code review of the requested PR or code changes, taking into consideration code standards
---

## Role

You are a world-class autonomous code review agent. You operate within a secure GitHub Actions environment.
Your analysis is precise, your feedback is constructive, and your adherence to instructions is absolute.
You do not deviate from your programming. You are tasked with reviewing a GitHub Pull Request.

## Primary Directive

Your sole purpose is to perform a comprehensive and constructive code review of this PR, and post all feedback and suggestions using the **GitHub review system** and provided tools.
All output must be directed through these tools. Any analysis not submitted as a review comment or summary is lost and constitutes a task failure.

## Input data
PR NUMBER: $ARGUMENTS

You MUST follow these steps to review the PR:
1. **Start a review**: Use `mcp__github__create_pending_pull_request_review` to begin a pending review
2. **Get diff information**: Use `mcp__github__get_pull_request_diff` to understand the code changes and line numbers
3. **Get list of files**: If you can't get diff information, use `mcp__github__get_pull_request_files` to get the list of files that were added, removed, and changed in the pull request
4. **Add comments**: Use `mcp__github__add_comment_to_pending_review` for each specific piece of feedback on particular lines
5. **Submit the review**: Use `mcp__github__submit_pending_pull_request_review` with event type "COMMENT" (not "REQUEST_CHANGES") to publish all comments as a non-blocking review

You can find all the code review standards and guidelines that you MUST follow here: `.github/instructions/code-review.instructions.md`

## Output format

**CRITICAL RULE** - DO NOT include compliments, positive notes, or praise in your review comments.
Be thorough but filter your comments aggressively - quality over quantity. Focus ONLY on issues, improvements, and actionable feedback.

**Output Violation Examples** (DO NOT DO THIS):
`The code follows best practices by...`
`Positive changes/notes`

**Important**: Submit as "COMMENT" type so the review doesn't block the PR.

Yes, some wording might be weird like praising the AI with “You are a world-class” or “your adherence to instructions is absolute”. Like we mentioned about using uppercase “DO NOT” or “IMPORTANT”, and others, I can’t explain some of this stuff or find enough research that claims this affects how the LLM pays attention to instructions. I just experiment and learn, and Gemini likes to use this phrase for code reviews as well 😄 (as well has 115 other devs on GitHub 😅).

To be honest, we still have too much noise in AI PR comments, or just tons of fluff. The bright side is, at least the compliments have kind of disappeared 😅 . You might enjoy getting this:

I don’t 🤣, especially when 1 PR has 5 of these. I do praise comments for my team yes, because positive comments are good… when it comes from a human who knows the other person, IMO. Also, there are many comments that don’t belong in a PR, they belong in a linter or other tools. We have CSharpier and .NET analyzers for that.

It also doesn’t have the best GitHub integration for now, at least we’ve had some problems (400 errors, branch 404 errors) with the GitHub action. Like not having access to GitHub mcp tools, even though we set it in allowed-tools option.

Anyway, we iterated a lot on instructions and prompts so far, since we use them for both Claude and Copilot. Here is a quick recap of what features we use from Claude Code:

Sub-agents (custom and built-in)
Built-in /review and security review commands
Custom slash commands (code-review.md)
Plugins, specifically code-review plugin authored by Boris Cherny

We leverage those 2 built-in commands, in parallel, but it’s just to see if we get any good feedback. Our custom code review slash command already does a good review following our guidelines, plus the “code-review” plugin from Boris is works very well with parallel agents. We basically went through the famous spiral:

Write CLAUDE.md -> Ask for code review -> Find bad comments and noise we don't want -> Re-write CLAUDE.md and other files -> Do some meta-prompting -> Repeat

Like I said, our custom code review prompt/command has evolved through time, and was refined when we learned something new. We started with this incredible suggestion to use the GitHub MCP. We also searched for other GitHub repos, mostly .NET related to see how they set up their instructions. In case they have anything particular around code review (e.g. for GitHub Copilot). I find .NET Aspire to be a super cool real-life example 🙂 . I think a lot of their AI adoption is lead by David Fowler. So I often check their PRs to see what we can learn from them, e.g. this one.

Anyway, our prompt was still a bit vague, so we had some chats with Claude, good old meta-prompting 🙂. After a while, Claude suggested a new file that has all the coding standards and bad smells we want to avoid – code-review.instructions.md. It does live under .github/instructions but it doesn’t matter, Claude can use it. The bad smells are specific and we see them referenced quite often in our PRs now. Still, we don’t have a perfect solution for overly large PRs. We simply communicate more often or have more than one dev working in the PR for those cases. When a feature genuinely requires lots of new code, the best forum to debate and provide actionable feedback is by talking. Sure, this isn’t always possible, people are busy or prefer async work. In our team going on call, or during the demo of the PR, helps make large PRs way more digestible. Draft PRs also work somewhat, to get some feedback early on.

Avoiding noise comments

Our biggest lesson learned here is running locally our custom slash command for code review and using sug-agents. Locally, we can try to provide the proper context for the review, the rest is the agent using tools and doing reasoning. No noise gets sent to GitHub comments because all the back-and-forth is done in the chat, plus right now Claude Code works better locally, not on GitHub Actions. Having sub-agents has been amazing since the main reason Claude Code uses it is for context management. Since we now have a built-in Explore sub-agent, our code review command uses that in order to have Explore sub-agents run in parallel (with Haiku 4.5) and not clog up the main context window.

I’ve learned recently of other devs using a different workflow, basically leveraging the Task tool for the main agent to spawn sub-agents. Whichever way you want to do it, using a sub-agent that is focused on exploring the codebase and potential impacts of this PR is something I recommend.

Saving learnings in memory

Every once in a while, once we’ve merged a few PRs. We use Claude to improve itself again based on these PRs. This is our prompt:

Please look at the 5 most recent PRs in our GitHub repository, and check for learnings in order to improve the code review workflow. Please ultrathink on this task, so that all necessary memory files are updated taking into account these learnings, like @CLAUDE.md and @.github\instructions\ Focus on seeing code review comments that were good and made it into the codebase afterwards (e.g. coding standards violations). Ignore bad comments that were resolved with a "negative comment" or thumbs down emoji. Ask me clarifying questions before you begin. YOU MUST create a changelog file explaining why you made these edits to instruction files. Each learning must reference a PR that exists. The best is for you to link the exact comment that you used for a given learning

At the end of the session, we usually have a few items that are good enough to add. Mostly are learnings around bugs we can catch earlier, some are coding standards. Honestly, a lot of suggestions aren’t what I want or I just think they won’t be useful in future code reviews. But doing this has been important for me to also take a step back and think about what we can learn from the work we’ve already merged. I reflect on it and then discuss with my team. I’ve seen others also talk about this idea and have a learnings.md, e.g. this repo. At least this process seems better for us than simply using emojis to give feedback that CodeRabbit blog also eludes to 😅.

GitHub Copilot

Copilot’s code review features were super basic in the beginning. We tried and experimented with it a lot when it came out. It only caught nitpicks, console.log and typos, really not helpful on any other area. Sure catching this is good, but a human reviewer catches that in the first pass too. It didn’t support all languages so we often got 0 comments or feedback. Then in the last months, completely different, night and day.

If you have seen GitHub Universe, you know what’s new. But in case you don’t know, the GitHub team has invested heavily in Copilot code review and coding agent, and it shows. The code review agent is often right in every comment, it makes suggestions that are actually based on our instructions and memory files, meaning our PRs follow consistent code style and team conventions (with a link to these docs).

And the agent session is somewhat transparent, since you can view it in GitHub actions now:

I mean “somewhat” because there are things I can’t configure, just like Claude Code and most tools, I guess 😅. In the logs I can see the option UseGPT5Model=false, and that it’s using Sonnet 4.5. There is also this “MoreSeniorReviews” flag that I couldn’t find any info on, and believe me… I wanted to because it was set to false 🤣 – the logs show ccr[MoreSeniorReviews=false;EnableAgenticTools=true;EnableMemoryUsage=false...

Are you telling me there could be a hidden way to get a more senior review… sign me up! Jokes aside, I couldn’t find much info on the endpoint api.githubcopilot.com/agents/swe of CAPI (presumably Copilot API) the Autofind agent was calling, and the contents of the ccr/callback saved in results-agent.json. I can only hope some of these options are configurable in the future.

I checked the MCP docs, hoping to find details about these options, but no luck.

Anyway, it also now has access to CodeQL and some linters, which is amazing because we didn’t have this before. It’s the way we are able to leverage CodeQL analysis in all our PRs now, we couldn’t do this in any other AI code review tool. We also see that it calls the tool “store_comment” during its session, and only submits the comments to GitHub in the end. This is useful since sometimes it stores a comment because it thought something was wrong in the implementation, and afterwards it read more code into context that invalidated the stored comment, so it no longer submits that comment in the PR. Much like the CodeRabbit validation agent, reducing the amount of noise we get in PRs.

CodeRabbit and Qodo

Let’s start with the cool features CodeRabbit has:

Code diagrams in Mermaid
Generates a poem! Yes, a poem for my PR
Summary of changes added to the description

Now… I gotta be honest, I don’t care about any of them 😅. They are cool, but I only glance at the poem or ignore it. Never read or care about the summary; I get one from Copilot and edit it myself. All code and sequence diagrams I saw generated in our PRs, were simply not useful, but a lot are from front-end code. I simply don’t look at them later, and if it makes sense, we update our architecture diagrams later once the code is merged. With that said, the code suggestions and feedback it obscene. By far the best code review AI tool when it comes to actionable and valuable feedback/suggestions (by a long shot)! Even if we didn’t configure .coderabbit.yaml or tried to optimize it, CodeRabbit already uses Claude and Copilot instructions so the work we did on those was probably used in CodeRabbit. In some of our PRs it caught some nasty bugs and gave super useful feedback. Our team was impressed!

The insights CodeRabbit adds during code review piqued my interest. I read a few of their blog posts on context engineering like this one, where I found it interesting that there is a separate validation agent before submitting comments. This is probably why they maintain a high signal-to-noise ratio. I also read their open-source version of CodeRabbit, they have some prompts there. I know it’s old, but it’s what I have access to. I especially like the instructions that we also have 😅 “Do NOT provide general feedback, summaries, explanations of changes, or praises for making good additions”.

We basically tried to have Claude and Copilot understand our large codebase, not focusing only on the PR diff. It’s harder, we still have a lot to improve here. CodeRabbit claims it’s known to be great at understanding large codebases. I don’t see any research backing this, just opinions. But yes, we humans don’t like large PRs either:

In my opinion I couldn’t find that many large PRs that were way better reviewed by CodeRabbit, in comparison to Claude Code and Copilot. But one thing we liked a lot is that it uses collapsed sections in markdown very well, for example:

But I mean, we did have cases that we tried to use Claude Code for code review on a PR that was reviewed by CodeRabbit, and like ~60% of the context window was comments made by CodeRabbit. All that markdown ain’t friendly for AI with limited context windows. There were times I swear I could see Claude behind every word CodeRabbit made, with the “You’re absolutely correct” 🤣, e.g.

But it could be GPT models or whatever, we never truly know what is behind these products 🙂.

Qodo

As for Qodo, we liked the fact it checks for compliance and flags violations as non-compliant (no other tool had this built in). This was previously just a bullet point in our markdown file. The code review feedback was good, sometimes we ended up doing the suggested changes Qodo leaves in the comment. After reading more about what compliance checks Qodo does, we improved by adding specific instructions on our code-review.instructions.md for ISO 9001, GDPR and others:

## Regulatory Compliance Checks

### Data Protection (GDPR/HIPAA/PCI-DSS)
- Does this code handle PII (Personally Identifiable Information)?
- Are sensitive fields properly encrypted at rest and in transit?
- Is data retention policy followed (deletion after X days)?
- Are audit logs created for data access?
- Is data anonymization/pseudonymization applied where required?

### Security Standards (SOC 2 / ISO 27001)
- Are all external API calls wrapped with proper error handling?
- Is input validation present for all user inputs?
- Are authentication checks present on all sensitive endpoints?
- Are secrets/credentials stored securely (no hardcoding)?
- Is sensitive data logged or exposed in error messages?

We kept experimenting with Qodo for longer than CodeRabbit, but the insights and feedback never reached the level of CodeRabbit. It was still a good tool that improved our codebase and sparked good discussions.

Tool of choice

Our prompts/instructions can still be improved, of course. We’ve experimented with different prompts, memory and instruction files. We’ve also researched how other teams use AI for code review, and how tools like CodeRabbit do context engineering. All of this is because our goal is to continue to improve our software development process and ensure high quality. Adopting new tools is a way of achieving this goal. Given that most AI code review tools have a price tag, we decided to focus on using only one/two tools and optimizing them. Yes, it’s Claude Code and GitHub Copilot 😄. I basically use 100% of both Copilot and Claude every month, but I get more requests from Claude even though I hit the weekly rate limit every time.

We know CodeRabbit is amazing, and these paid AI tools will continue getting better. There is actually a new tool supporting code review we didn’t use, Augment Code (these AI companies move so fast 😅). No amount of customizing our setup with Claude or Copilot will reach the same output as these specific code review paid tools. But for us, it makes more sense to pay for one tool, for example, and leverage it in multiple steps of our software development lifecycle.

Improving multi-agent collaboration

Claude and Copilot are working very well for our code review process. But like I’ve been saying, there is work to do. We learned a lot from using each tool, but there are more areas to improve, at least in Claude Code since we have more flexibility there. I’m currently looking at implementing the “Debate and Consensus” multi-agent design pattern (Google DeepMind paper and Free-MAD), basically a group chat orchestration. I just want to try it out, I’m not sure I’ll have better code reviews by having different agents (e.g. Security, Quality and Performance) debate and review the code through different perspectives. If they run sequentially, the quality agent can have questions for the performance agent, and each can agree or disagree with the reported issues. We can try out the LLM-as-a-Judge as well, to focus on reducing noise and following code quality standards.

Anyway, we’ll continue learning, optimizing, and improving the way we work 🙂.

Resources

Conclusion

The number one thing we learned is: experimentation is king. Like we talked before, the Jagged Frontier changes with every model release. Claude Opus 4.5 behaves a bit differently, for example, on tool triggering… maybe we can stop shouting and being aggressive 🤣. We must experiment and keep learning. We can’t calibrate the prompt once and expect the best result.

For now we are quite happy, the human reviewer has more time to focus on design decisions and discuss trade-offs with the author of the PR. I don’t envision a future where AI does 100% of the code review.

If you’re considering AI for code reviews, my advice is simple: just try it. Pick one tool, run a one-month pilot, and see what happens. The worst case is you turn it off. The best case is that your team becomes augmented and probably improves code quality.

My next blog post in this series will be about how we are using agentic coding tools! Are you using AI code review tools? I’d love to hear from you what your experience has been. Leave a comment and let’s chat 🙂 .

The post Lessons learned improving code reviews with AI appeared first on Blog IT.

Becoming augmented by AI

David Pereira — Wed, 10 Sep 2025 17:24:13 +0000

Introduction
The “Jagged Frontier” concept
Becoming augmented by AI
- AI as a co-worker
- AI as a co-teacher
My augmentation list
- Custom instructions
- Meta-prompting
Resources
Conclusion

Introduction

We’re deep into Co-Intelligence in Create IT’s book club — definitely worth your time! Between that and the endless stream of LLM content online, I’ve been in full research mode. Still, I can’t just watch and hear others talk about these tools, I must experiment myself and learn how to use them for my use cases.

Software development is complex. My job isn’t just churning out code, but there are many concepts in this book that we’ve internalized and started adopting. In this post, I’ll share my opinions and some of the practical guidelines our team has been following to be augmented by AI.

The “Jagged Frontier” concept

The Jagged Frontier described by the author Ethan Mollick is an amazing concept in my opinion. It’s where tasks that appear to be of similar difficulty may either be performed better or worse by humans using AI. Due to the “jagged” nature of the frontier, the same knowledge workflow of tasks can have tasks on both sides of the frontier according to a publication where the author took part.

This leads to the Centaur vs. Cyborg distinction which is really interesting. Using both approaches (deeply integrated collaboration and separation of tasks) seems to be the goal to achieve co-intelligence. One very important Cyborg practice seen in that publication is “push-back” and “demanding logic explanation”, meaning we disagree with the AI output, give it feedback, and ask it to reconsider and explain better. Or as I often do, ask it to double-check with official documentation that what it’s telling me is correct. It’s also important to understand that this frontier can change as these models improve. Hence, the focus on experimentation to understand where the Jagged Frontier lies in each LLM. It’s definitely knowledge that everyone in the industry right now wants to acquire (maybe share it afterwards 😅).

Becoming augmented by AI

I’m aware of the marketed productivity gains, where GitHub Copilot usage makes devs 55% faster, and other studies that have been posted about GenAI increasing productivity. I’m also aware of the studies claiming the opposite 😄 like the METR study showing AI makes devs 19% slower. However, I don’t see 55% productivity gains for myself, and I don’t think it makes me slower either.

In my opinion, productivity gains aren’t measured by producing more code. Number of PRs? Nope. Acceptance rate for AI suggestions? Definitely not! I firmly believe the less code, the better. The less slop the better too 😄. I’m currently focused on assessing DORA metrics and others for my team, because we want to measure how AI-assisted coding and the other ways we use it as an augmentation tool, actually improves those metrics, or make them worse. The rest of marketing and hype doesn’t matter.

Ethan Mollick provides numerous examples and research on how professionals across industries are already leveraging AI tools, like the Cyborg approach. But if we focus on our software industry, what does it mean for a tech lead to be augmented by AI? What tasks would be good to involve an AI in without compromising quality?

AI as a co-worker

For a tech lead that works with Azure services, an important skill is to know how to leverage the correct Azure services to build, deploy, and manage a scalable solution. So it becomes very useful to have an AI partner that can have a conversation about this, for example about Azure Durable Functions. This conversation can be shallow, and not get all the implementation details 100% correct. That’s okay, because the tech lead (and any dev 😅) also needs to exhibit critical thinking and evaluate the AI responses. This is not a skill we want to delegate to these models, at least in my opinion and in the author’s opinion. There is a relevant research paper about this by Microsoft as well.

The goal can simply be to have a conversation with a co-worker to spark some new ideas or possible solutions that we haven’t thought of. Using AI for ideation is a great use case, not just for engineering, but for product features too like UI/UX, important metrics to capture, etc. If it generates 20 ideas, there is a higher chance you find the bad ones, filter them out, and clear your mind or steer it into better ideas. Here is an example to get some ideas on fixing a recurring exception:

Example of using AI to get multiple options

It asks clarifying questions so that I can give it more useful context. Then I can see the response, iterate, or ask for more ideas, etc. I usually always set these instructions for any LLM:

Ask clarifying questions before giving an answer. Keep explanations not too long. Try to be as insightful as possible, and remember to verify if a solution can be implemented when answering about Azure and architecture in general.
It's also very important for you to verify if there is official documentation that supports your claims and statements. Please find official documentation supporting your claims, before responding to a user. If there isn't documentation confirming your statement, don't include it in the response.

That is also why it searches for docs. I’ve gotten way too many statements in the LLM’s response that when I follow-up on, it realizes it made an error, or assumption, etc. When I ask it further about that sentence that it just gave me, I just get “You’re right – I was wrong about that”… Don’t become too over-reliant on these tools 😅.

AI as a co-teacher

With that said, the tech lead and senior devs are also responsible for upskilling their team by sharing knowledge, best practices, challenging juniors with more complex tasks, etc. And this part of the job isn’t that simple; it’s hard to be a force multiplier that improves everyone around you. So, what if the tech lead could use AI in this way, by creating reusable prompts, documentation, and custom agents? How about the tech lead uses AI as a co-teacher, and then shares how to do it with the rest of the team? All of these are then able to help juniors be onboarded, help them understand our codebase and our domain. Claude Code Best practices post also reference onboarding as a good use case that helps Anthropic engineers:

“At Anthropic, using Claude Code in this way has become our core onboarding workflow, significantly improving ramp-up time and reducing load on other engineers.”

A lot of onboarding time is spent on understanding the business logic and then how it’s implemented. For juniors, it’s also about the design patterns or codebase structure. So I really think this is a net-positive for the whole team.

My augmentation list

It might not be much, but these are essentially the tasks I’m augmented by AI:

Technical:

Initial code review (e.g. nitpicks, typos), some stuff I should really just automate 😅
Generate summaries for the PR description
Architectural discussions, including trade-off and risk analysis
- Draft an ADR (Architecture decision record) based on my analysis and arguments
Co-Teacher and Co-Worker
- “Deep Research” and discussion about possible solutions
- Learn new tech with analogies or specific Azure features
- Find new sources of information (e.g. blog posts, official docs, conference talks)
Troubleshooting for specific infrastructure problems
- Generating KQL queries (e.g. rendering charts, analyzing traces & exceptions & dependencies)
Refactoring and documentation suggestions
Generation of new unit tests given X scenarios

Non-technical

Summarizing book chapters/blog posts or videos (e.g. NotebookLM)
Role play in various scenarios (e.g. book discussions)

Of course, we also need to talk about the tasks that fall outside the Jagged Frontier. Again, these can vary from person to person. From my usage and experiments so far, these are the tasks that currently fall outside the frontier:

Being responsible for technical support tickets, where a customer encountered an error or has a question about our product. This involves answering the ticket, asking clarifying questions when necessary, opening up tickets on a 3rd party that are related to this issue, and then resolving the issue.
Deep valuable code review. This includes good insights, suggestions, and knowledge sharing to improve the PR author’s skills. CodeRabbit does often give valuable code reviews, way better than any other solution. Still not the same as human review 🙂
Development of a v0 (or draft) for new complex features
Fixing bugs that require business domain knowledge

Delegating some of those tasks would be cool, at least 50% 😄, while our engineering team focuses on other tasks. But oh well, maybe that day will come.

AI-assisted coding

AI-assisted coding can be very helpful on some tasks, and lately my goal is to increase the number of tasks AI can assist me. In our team, we’ve read Claude Code Best practices in order to learn and see what fits best for our use case. Then we dive deeper in some topics that post references, for example these docs were very useful to learn about Claude’s extended thinking feature, complementing the usage of “think” < “think hard” < “think harder” < “ultrathink”. We also found this post by Simon about this entire feature that was interesting. In most tasks, using an iterative approach, just like normal software development, is indeed way better than one-shot with the perfect prompt. Still, if it takes too many iterations, like some bugfixes were too complex because it’s hard to pinpoint the location of the bug, then it loses performance and overall becomes bad (infinite load spinner of death 🤣).

Before we can use AI-assisted coding on more complex tasks, we need to improve the output quality. So we’ve invested a lot of time in fine-tuning custom instructions and meta-prompting. Let’s talk about these two.

Custom Instructions

According to Copilot docs, instructions should be short, self-contained statements. Most principles in prompt engineering are about being short, specific, and making sure our critical instructions is something the model takes special attention to. Like everyone talks about, the context window is very important, so it’s really good if we can just have an instruction file of 200 lines. The longer our instructions are, the greater the risk that the LLM won’t follow them, since it can pay more attention to other tokens or forget relevant instructions. With that said, keeping instructions short is also a challenge when we use the few-shot prompting technique and add more examples.

To build our custom instructions, we used C# and Blazor files from the awesome-copilot repo and other sources of inspiration like parahelp prompt design to get a first version. We wanted to know what techniques other teams use. Then we made specific edits to follow our own guidelines and removed rules specific to explaining concepts, etc. We also added some capitalized words that are common in system prompts or commands, like IMPORTANT, NEVER, ALWAYS, MUST. The IMPORTANT word is also at the end of the instruction, to try and refocus the attention to coding standards:

IMPORTANT: Follow our coding standards when implementing features or fixing bugs. If you are unsure about a specific coding standard, ask for clarification.

I’m not 100% sure how this capitalization works, or why it works… and I have not found docs/evidence/research on this. All I know is that capitalized words have different tokens than lowercase. It’s probably something the model pays more attention to, since in the training data, when we use these words, it means it’s important. I do wish Microsoft, OpenAI, and Anthropic included this topic on capitalization in their prompt engineering docs/tutorials.

It’s at the end of our file since it’s also being researched that the beginning and end of a prompt are what the LLM pays more attention to and finds more relevant. Some middle parts are “meh” and can be forgotten. Microsoft docs say the same essentially, it’s known as “recency bias“. In most prompts we see, this section exists at the end to refocus the LLM’s attention.

Meta-prompting

Our goal also isn’t to have the perfect custom instructions and prompt, since refining it later with an iterative/conversational approach works well. But we came across the concept of meta-prompting, a term that is becoming more popular. Basically, we asked Claude how to improve our prompt, and it gave us some cool ideas to improve our instructions/reusable prompts.

But don’t forget to use LLMs with caution… I keep getting “You’re absolutely right…” and it’s annoying how sycophantic it is oftentimes 😅

The quality of the output is most likely affected by the complexity of the task I’m working on too. Prompting skills only go so far, from what I’ve researched and learned so far, I can say there is a learning curve for understanding LLMs. So we need to continue experimenting and learning the layers between our prompt and the output we see.

Resources

This is not an exhaustive list by any means, just some resources I find very useful:

Conclusion

I’ve enjoyed learning and improving myself over the years. But with GenAI I now feel like I could learn a lot more and improve myself even further since I’m choosing them as augmentation tools. Hopefully, this article motivates you to pursue AI augmentation for yourself. It’s okay to be skeptical about all the hype you watch and hear around these tools. It’s a good mechanism to not fall for all the sales pitches and fluff CEO’s and others in the industry talk about. Just don’t let your skepticism prevent you from learning, experimenting, building your own opinion, and finding ways of improving your work 🙂.

Still… I can’t deny my curiosity to know more about how these systems work underneath. How is fine-tuning done exactly? How does post-training work? Can these models emit telemetry (logs, traces, metrics) that we can observe? Why does capitalization (e.g. IMPORTANT, MUST) or setting a role/persona improve prompts? Can we really not have access to a high-level tree with the weights the LLM uses to correlate tokens, and use it to justify why a given output was produced? Or why an instruction given as input was not followed? It’s okay to just have a basic understanding and know about the new abstractions we have with these LLMs. But knowing how that abstraction works leads to knowing how to transition to automation.

I will keep searching and learning more in order to answer these questions or find engineers in the industry who have answered them. Especially around interpretability research, which is amazing!!! I recommend reading this research, for example – Tracing the thoughts of a large language model. Hope you enjoyed reading, feel free to share in the comments below how you use AI to augment yourself 🙂.

The post Becoming augmented by AI appeared first on Blog IT.

AI Archives - Blog IT

How we use agentic coding tools in our favor – Copilot

Table of Contents

Introduction

Planning the experiment

Cost considerations

Testing Copilot coding agent

Code review comments

Improvements mid-way

Our pain points

Tests

Doesn’t follow our PR template

Lack of docs on installed tools and observability

No reasoning around a simpler solution

Reliability problems

Performance considerations

Main takeaways

Resources

Conclusion

Lessons learned improving code reviews with AI

Table of Contents

Introduction

Why we started experimenting

Our AI code review journey

Claude Code

Avoiding noise comments

Saving learnings in memory

GitHub Copilot

CodeRabbit and Qodo

Qodo

Tool of choice

Improving multi-agent collaboration

Resources

Conclusion

Becoming augmented by AI

Table of Contents

Introduction

The “Jagged Frontier” concept

Becoming augmented by AI

AI as a co-worker

AI as a co-teacher

My augmentation list

AI-assisted coding

Custom Instructions

Meta-prompting

Resources

Conclusion