"Guid has already been taken"

denschub · June 15, 2020, 4:57pm

And you are sure you’re seeing the same error:

for the diasp.org user? Is it exactly the same message, or something different? Because what you’re describing doesn’t make sense at all from an implementation perspective, as our code is “find the person by it’s diaspora_id, check if the person is there, check if it has a profile, and only if something is missing, run a discovery cycle.”, you sour pod should never even attempt to discover urbanroman@diasp.org, because it knows the person and the profile…

blacksam · June 15, 2020, 6:05pm

I’m seeing a variety of error messages that share the string “Guid has already been taken”. For this particular user I just grepped the log again and saw something interesting (I snipped out a few large blocks of encoded data):

[2020-06-15T15:36:16] INFO  PID-339 TID-47235927321940 DiasporaFederation::Federation::Receiver::Public: successfully received Reshare:539737a090800138eb6e047d7b62795e:c625ad6090750138eb74047d7b62795e from person xxxx@diasp.org
[2020-06-15T15:36:17] INFO  PID-339 TID-47235927321940 DiasporaFederation::Federation::Receiver::Public: successfully received Comment:e05a83a090890138eb82047d7b62795e:539737a090800138eb6e047d7b62795e from person xxxx@diasp.org
[2020-06-15T15:36:32] WARN  PID-339 TID-47235962459140 Sidekiq: {"context":"Job raised exception","job":{"class":"Workers::ReceivePublic","args":["<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<me:env xmlns:me=\"http://salmon-protocol.org/ns/magic-env\">\n  <me:data type=\"application/xml\">...</me:sig>\n</me:env>\n",false],"retry":10,"queue":"urgent","backtrace":15,"jid":"cc00609a47afc17eefe4c57e","created_at":1592149758.3205445,"locale":"en","enqueued_at":1592234835.9817057,"error_message":"Failed discovery for xxxx@diasp.org: ActiveRecord::RecordInvalid: Validation failed: Guid has already been taken","error_class":"DiasporaFederation::Discovery::DiscoveryError","failed_at":1592149758.5757997,"retry_count":9,"error_backtrace":["app/models/person.rb:325:in `find_or_fetch_by_identifier'","config/initializers/diaspora_federation.rb:85:in `block (3 levels) in <top (required)>'","app/workers/receive_public.rb:11:in `block in perform'","app/workers/receive_base.rb:11:in `filter_errors_for_retry'","app/workers/receive_public.rb:10:in `perform'"],"retried_at":1592165954.635385},"jobstr":"{\"class\":\"Workers::ReceivePublic\",\"args\":[\"<?xml version=\\\"1.0\\\" encoding=\\\"UTF-8\\\"?>\\n<me:env xmlns:me=\\\"http://salmon-protocol.org/ns/magic-env\\\">\\n  <me:data type=\\\"application/xml\\\">...</me:data>\\n  <me:encoding>base64url</me:encoding>\\n  <me:alg>RSA-SHA256</me:alg>\\n  <me:sig key_id=\\\"...\\\">...</me:sig>\\n</me:env>\\n\",false],\"retry\":10,\"queue\":\"urgent\",\"backtrace\":15,\"jid\":\"cc00609a47afc17eefe4c57e\",\"created_at\":1592149758.3205445,\"locale\":\"en\",\"enqueued_at\":1592234835.9817057,\"error_message\":\"Failed discovery for xxxx@diasp.org: ActiveRecord::RecordInvalid: Validation failed: Guid has already been taken\",\"error_class\":\"DiasporaFederation::Discovery::DiscoveryError\",\"failed_at\":1592149758.5757997,\"retry_count\":9,\"error_backtrace\":[\"app/models/person.rb:325:in `find_or_fetch_by_identifier'\",\"config/initializers/diaspora_federation.rb:85:in `block (3 levels) in <top (required)>'\",\"app/workers/receive_public.rb:11:in `block in perform'\",\"app/workers/receive_base.rb:11:in `filter_errors_for_retry'\",\"app/workers/receive_public.rb:10:in `perform'\"],\"retried_at\":1592165954.635385}"}

Does that help at all?

blacksam · June 15, 2020, 6:11pm

Interestingly I’m seeing some posts from the user I’ve been investigating that popped up just today. It looks like whatever this is, it’s intermittent.

denschub · June 15, 2020, 6:13pm

That makes even less sense. The log shows that a post was successfully received. The eror message shows that the failing job is the 10th retry of a job. We retry things that fail, and that’s fine since the profile is clearly there. It’s still unknown why we tried to discover the profile in the first place, as it clearly is there, but whatever.

You shared GUDI 9c54ab737368bed2, which is https://diasp.org/people/9c54ab737368bed2. This profile is known to your server, so there is no issue. Also, you claimed there is a federation issue, but I don’t see an indication for that. Their latest post is https://diasp.org/posts/50ee848091440138ec3a047d7b62795e, and that post successfully arrived on your pod: https://social.gibberfish.org/posts/50ee848091440138ec3a047d7b62795e

Do you have any example of a user thats a) not a carelessly renamed friendica instance, and b) actually has trouble federating content to your server? It honestly feels like we’re debugging an issue that’s not even there.

Also, you mentioned

did you do this while the server was running?

blacksam · June 15, 2020, 6:17pm

did you do this while the server was running?

No.

As far as federation goes, compare:

https://nerdpol.ch/people/9c54ab737368bed2
https://social.gibberfish.org/people/9c54ab737368bed2

You’ll notice that there’s a gap.

denschub · June 15, 2020, 6:24pm

I can’t, remote profile pages are not visible publicly and I do not have an account on your pod. Gaps can have a wide range of reason, including pod downtime (during a migration for example), or solar flares causing them to take an extra roudntrip through the retry queue. Since new posts from that user appear to arrive, I don’t see anything we can do to investigate this case, since it’s not broken.

Either way, unless there is a profile that consistently fails to fetch or has an inconsistent state, there is nothing we can debug here, as … well, I don’t know how to debug an issue that’s not there. I’ve chatted with other folks including @supertux88, and we’re honestly out of ideas.

I’d suggest you to turn off diaspora, and run a REINDEX DATABASE on your database. If that fails, then something might be horribly wrong in the database layer and the error message might lead to a reason what’s wrong. If that passes, then… no idea. Wait 7 days for all the backlog, scheduled tasks, and retries are done and check if you still see these errors. For things like receiving posts, we do retry those if something goes wrong, up to 15 times with increasing delays. Those retries will be done in a week.

If there is still an issue, and there is a profile that should be on the server but isn’t, we can try manually fetching posts and profiles and see where it fails.

blacksam · June 15, 2020, 6:26pm

Ok, I’ll try that. Thank you for the assistance.

denschub · June 15, 2020, 6:26pm

Let us know if things keep failing in a week.

blacksam · June 15, 2020, 10:51pm

Looks like there’s trouble…

diaspora_production=# REINDEX DATABASE diaspora_production;
ERROR: could not create unique index “index_comments_on_guid”
DETAIL: Key (guid)=(e18176ef-765e-dcb4-333b-039331201028) is duplicated.
CONTEXT: parallel worker

The comment is indeed in there twice, with different updated_at timestamps.

2020-06-07 09:55:30.933571
2020-06-07 21:22:10.0791

I’m not seeing any of the “Guid has already been taken” errors for the associated user.

Looks like the comments table has 9 duplicate records:

diaspora_production=# select guid, count() from comments group by guid HAVING count() > 1;
guid | count
--------------------------------------±------
0b6b25a8-205e-db60-5850-c96501507991 | 2
0b6b25a8-205e-dc88-cd60-707188197746 | 2
e18176ef-455e-dbfa-8b90-4e0340069256 | 2
e18176ef-585e-dbfd-1cc0-c64822629284 | 2
e18176ef-595e-dc1f-9a7b-812551559434 | 2
e18176ef-765e-dcb4-333b-039331201028 | 2
e18176ef-865e-dc16-43ae-0d6239894691 | 2
e18176ef-875e-e73d-e84d-e0b136848932 | 2
e18176ef-945e-dbfa-9b55-405043886660 | 2
(9 rows)

The posts table has 71 duplicates. people has 103. Ugh.

I’ve been having some intermittent issues with stale file handles, which should be fixed now. I wonder if this was somehow causing postgres to unwittingly violate key constraints?

I know you guys don’t like manually modifying the database. Any ideas how to safely clean this up? Can I nuke the rows created since the migration and leave the originals?

Thanks.

(PS - Discourse made me wait 4 hours to post this because of some kind of new user limits. I may not be able to respond for a while)

denschub · June 15, 2020, 11:20pm

Well, that kinda sucks. At least we now know why nothing is making sense, so that’s something.

We basically have two solutions here: Restoring to a known-good backup state (look at all duplicate, take the oldest timestamp?) and just eat the data loss. Looking at the timestamps you posted, that’s already more than 7 days, so that’s probably not a good solution. The other solution would be to attempt to manually fix the database.

Deleting those records alone is not a solution. The GUID is not the tables primary key, and since it looks like the primary key is fine, there’s a chance that all of the duplicates have references to them, so one would need to update the references as well, and for that, you kinda need to know the database structure. Ideally, we’d hop on some synchronous communication channel for that (IRC or so), but unfortunately, I’m fully planned this week, so I can’t help. However, @supertux88 said he might be able to help, and he’ll send you a private message on Discourse soon, so you can coordinate the best channel and time to work on this together.

Well, yes, but postgres should complain heavily if there is a filesystem issue. I’ve seen a similarly broken diaspora* database before, but this was caused by a power crash paired with a somewhat broken harddrive, so that was unsurprising. (But that pod is back online now and only lost a couple of posts!). I did ask about the rsync during runtime, because that’s something postgres absolutely does not like, but who knows what happened here. These things usually don’t happen.

Sorry about that, that’s a limit for young accounts. It should no longer be an issue as you have enough posts and your account is older than 24 hours, but I manually promoted you out of the “new user” state just to make sure.

blacksam · June 15, 2020, 11:40pm

Great. I appreciate all the help.

blacksam · June 17, 2020, 12:12am

It looks like @supertux88 and I were able to get all the duplication resolved, and things look good now. “Missing” content is starting to filter in. Thanks to you both!