50 test cases ready

Click "Run Evaluation" to process all cases through Layer 1 (extraction) and Layer 2 (intent detection)

Eval Dataset (50 cases)

multi-artifact (10)single-artifact (7)no-action (7)approval (6)travel (2)edge-case (8)question (4)long-message (4)short (2)
eval_001Meeting + task + deadline + projectmulti-artifactgmail

Can we meet Thursday at 2pm to review the Q3 roadmap? Also, please send me the updated budget spreadsheet by EOD tomorrow. This is for the Project Atlas team.

meeting suggestionaction itemdeadlineproject referenceschedule meetingcreate taskset reminderlabel thread
eval_002Complex thread with mentionsmulti-artifactgmail

Quick update: the API migration is behind schedule. @Sarah needs to review the auth changes by Wednesday. Can we set up a sync with the backend team? Also, please label this thread under Project Phoenix.

action itemdeadlinemeeting suggestionproject referencecreate taskset reminderschedule meetinglabel thread
eval_003Travel with logisticsmulti-artifactgmail

I'll be in NYC next week for the client meeting at Google's Chelsea office. Can you book a conference room for Tuesday 10am? Also need to update the travel expense report.

travel infomeeting suggestionaction itemschedule meetingcreate task
eval_004Onboarding tasks with multiple deadlinesmulti-artifactgmail

Welcome to the team! Please complete the security training by Friday, set up your dev environment by next Monday, and schedule a 1:1 with your manager this week. The onboarding doc is in the Project Onboarding folder.

action itemaction itemmeeting suggestiondeadlineproject referencecreate taskschedule meetingset reminder
eval_005Sprint planning follow-upmulti-artifactchat

After today's sprint planning, here are the action items: @David needs to finalize the API spec by Thursday, @Emma should update the test suite, and we need to schedule the design review for next Tuesday. Tag this under Project Horizon.

action itemaction itemdeadlinemeeting suggestionproject referencecreate taskschedule meetingset reminderlabel thread
eval_006Clear meeting requestsingle-artifactchat

Can we do a quick 30-minute call tomorrow at 3pm to discuss the launch plan?

meeting suggestionschedule meeting
eval_007Single task assignmentsingle-artifactchat

Please update the README with the new deployment instructions.

action itemcreate task
eval_008Direct questionsingle-artifactchat

What is the current status of the database migration? Are we on track for the Tuesday deadline?

questionsend reply
eval_009Explicit deadline onlysingle-artifactgmail

Reminder: the compliance report is due this Friday at 5pm sharp. No extensions.

deadlineset reminder
eval_010Simple FYIno-actionchat

Just wanted to let you know the design review went well yesterday. The team loved the new navigation flow. No action needed from your side.

fyino action
eval_011Status update, no actionno-actionchat

FYI: The deployment completed successfully at 2am. All services are green. No issues detected.

fyino action
eval_012Thank you messageno-actionchat

Thanks for handling the client demo yesterday. Great job! The feedback was very positive.

fyino action
eval_013Approval request with deadlineapprovalgmail

I need your approval on the Project Atlas design doc before we proceed to engineering review. The deadline is Friday.

approval requestproject referencedeadlinesend replylabel threadset reminder
eval_014Budget approval requestapprovalgmail

The Q4 marketing budget proposal is ready for your sign-off. We need approval by end of week to proceed with vendor contracts.

approval requestdeadlinesend replyset reminder
eval_015Question needing replyapprovalgmail

Hi, can you confirm whether we should use the v2 or v3 API for the integration? The partner team is waiting on our decision.

questionsend reply
eval_016Conference traveltravelgmail

I will be attending the GCP Next conference in San Francisco from April 9-11. Please book flights and a hotel near Moscone Center.

travel infoaction itemcreate task
eval_017Project status with filingtravelgmail

The Project Mercury beta launch metrics are in. DAU is up 23% week over week. Filing this under the Mercury project folder.

project referencefyilabel threadno action
eval_018Vague possible meetingedge-casechat

We should probably catch up sometime next week about the project. Let me know what works.

meeting suggestionschedule meeting
eval_019Implicit task in question formedge-casechat

Could you take a look at the failing CI pipeline when you get a chance? It has been red since this morning.

action itemcreate task
eval_020Mixed FYI with subtle actionedge-casechat

The new office wifi password is WorkSmart2026. Also, remember to submit your expense reports by month end.

fyiaction itemdeadlinecreate taskset reminder
eval_021Technical questionquestionchat

Which authentication provider should we use for the new microservice? We are deciding between Auth0 and Firebase Auth.

questionsend reply
eval_022Multiple questions in one messagequestionchat

A few questions: When is the next release window? Who is the oncall this week? And do we have budget for a new staging environment?

questionquestionquestionsend reply
eval_023Long meeting follow-uplong-messagegmail

Hi team, following up on our discussion from the product review meeting. Here are the next steps we agreed on: 1) @Alex will finalize the PRD for the notification feature by next Wednesday. 2) Design team needs to provide mockups by the following Monday. 3) We should schedule a follow-up review in two weeks. The project codename is Starlight. Let me know if I missed anything from the notes.

action itemaction itemdeadlinemeeting suggestionproject referencecreate taskset reminderschedule meetinglabel thread
eval_024Incident post-mortemlong-messagegmail

Post-mortem summary for the March 15 outage: Root cause was a misconfigured load balancer. Impact: 45 minutes of degraded service. Action items: @ops team to add health check alerts by Friday, @SRE to update the runbook, and we need to schedule a review with the VP of Engineering.

action itemaction itemdeadlinemeeting suggestioncreate taskset reminderschedule meeting
eval_025Very short informal messageshortchat

Sounds good, thanks!

fyino action
eval_026Quarterly review prepmulti-artifactgmail

For the Q4 business review on December 15: @Lisa needs to prepare the revenue slides, @Tom should finalize partner metrics, and we need to book the large conference room. This is Project Titan.

meeting suggestionaction itemaction itemdeadlineproject referenceschedule meetingcreate taskset reminderlabel thread
eval_027Client escalationmulti-artifactgmail

Urgent: Client Acme Corp reported a data sync issue. @DevOps needs to investigate by end of day. Schedule a war room for 4pm today. Also loop in the VP of Customer Success.

action itemdeadlinemeeting suggestioncreate taskset reminderschedule meeting
eval_028Product launch checklistmulti-artifactgmail

Launch checklist for Project Aurora: 1) Marketing blog post needs final review by Tuesday. 2) Support docs must be live by Wednesday. 3) Schedule the all-hands demo for Thursday 11am. 4) File all launch materials under Aurora.

action itemaction itemmeeting suggestiondeadlineproject referencecreate taskschedule meetingset reminderlabel thread
eval_029Hiring pipeline updatemulti-artifactgmail

Hiring update: We have 3 finalists for the senior engineer role. @HR needs to send offer letters by Friday. Schedule a debrief with the hiring panel for Wednesday morning. This is for the Platform Team hiring initiative.

action itemdeadlinemeeting suggestionproject referencecreate taskset reminderschedule meetinglabel thread
eval_030Cross-team coordinationmulti-artifactchat

Need to align with the infrastructure team on the database migration timeline. @Jake from Infra can meet Tuesday or Wednesday. Also, update the Project Nexus tracker with the new timeline and remind the team about the code freeze on Friday.

meeting suggestionaction itemdeadlineproject referenceschedule meetingcreate taskset reminderlabel thread
eval_031Simple approvalsingle-artifactchat

Can you approve my PTO request for next week?

approval requestsend reply
eval_032Travel notificationsingle-artifactchat

FYI I will be working from the London office next month from March 10-14.

travel infoauto file
eval_033Clear project mentionsingle-artifactgmail

Tagging this thread under Project Lighthouse for tracking purposes.

project referencelabel thread
eval_034Celebration messageno-actionchat

Congrats to the team on shipping the v2.0 release! Great work everyone.

fyino action
eval_035Out of office noticeno-actiongmail

I am out of office today. Back tomorrow. For urgent issues, contact @Mike.

fyino action
eval_036Acknowledgmentno-actionchat

Got it, will review when I am back at my desk.

fyino action
eval_037Meeting recap, no actionno-actionchat

Quick recap: the meeting went well. We are aligned on the approach. Nothing else needed from anyone right now.

fyino action
eval_038Contract approvalapprovalgmail

The vendor contract for CloudSync needs your signature before we can proceed. Legal has reviewed and approved. Please sign off by Thursday.

approval requestdeadlinesend replyset reminder
eval_039Reply to client questionapprovalgmail

The client is asking if we can support SSO integration in the next release. Can you respond to them with our timeline? They need an answer by tomorrow.

questiondeadlinesend replyset reminder
eval_040Expense report approvalapprovalgmail

Your expense report for the NYC trip ($2,340) is pending approval. Please review and approve or reject by end of week.

approval requestdeadlinesend replyset reminder
eval_041Sarcastic messageedge-casechat

Oh great, another meeting. Sure, let us meet at 3pm Friday to discuss why we have too many meetings.

meeting suggestionschedule meeting
eval_042Message with only a linkedge-casechat

Here is the doc: https://docs.google.com/document/d/abc123

fyino action
eval_043Indirect deadlineedge-casegmail

The board presentation is next Thursday. We absolutely cannot miss this one. Make sure all financials are finalized.

deadlineaction itemset remindercreate task
eval_044Multiple people mentionededge-casechat

@Anna, @Ben, and @Carlos: please each submit your section of the design doc by Monday. We will merge and review Tuesday.

action itemdeadlinemeeting suggestioncreate taskset reminderschedule meeting
eval_045Forwarded message contextedge-casegmail

Forwarding this from the sales team. They need engineering capacity for the Acme integration. Can someone take ownership? Target completion is end of Q1.

action itemdeadlinecreate taskset reminder
eval_046Decision neededquestionchat

Should we go with AWS or GCP for the new data pipeline? Need a decision before we start the sprint on Monday.

questiondeadlinesend replyset reminder
eval_047Information requestquestionchat

Does anyone have the latest conversion rate numbers for the signup funnel? I need them for the investor update.

questionsend reply
eval_048Weekly status emaillong-messagegmail

Weekly status for Project Cosmos: Backend API is 80% complete, targeting beta by March 28. Frontend has a blocker on the auth flow that @Rachel is investigating. We need to schedule a cross-team sync with the mobile team before the end of next week. Also, please update the JIRA board and file this update under Project Cosmos.

action itemmeeting suggestiondeadlineaction itemproject referencecreate taskschedule meetingset reminderlabel thread
eval_049Executive strategy memolong-messagegmail

Strategy memo: After reviewing the competitive landscape, we should accelerate the AI features roadmap. Priority 1: ship smart compose by June. Priority 2: launch predictive scheduling by Q3. I need the engineering leads to provide effort estimates by next Friday. Schedule a strategy offsite for the leadership team in April. File under Project Intelligence.

action itemdeadlinemeeting suggestionproject referencecreate taskset reminderschedule meetinglabel thread
eval_050Single emoji responseshortchat

Noted.

fyino action